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Abstract 



> 

^^ , In recent years, a great many methods of learning from multi-view data by considering the 

^"^ ' diversity of different views have been proposed. These views may be obtained from multiple 



sources or different feature subsets. For example, a person can be identified by face, finger- 
print, signature or iris with information obtained from multiple sources, while an image can 
be represented by its color or texture features, which can be seen as different feature subsets 
of the image. In trying to organize and highlight similarities and differences between the 
variety of multi-view learning approaches, we review a number of representative multi-view 

., I learning algorithms in different areas and classify them into three groups: 1) co-training, 

^^ , 2) multiple kernel learning, and 3) subspace learning. Notably, co-training style algorithms 

JH ' train alternately to maximize the mutual agreement on two distinct views of the data; 

- - - multiple kernel learning algorithms exploit kernels that naturally correspond to different 

views and combine kernels either linearly or non-linearly to improve learning performance; 
and subspace learning algorithms aim to obtain a latent subspace shared by multiple views 
by assuming that the input views are generated from this latent subspace. Though there 
is significant variance in the approaches to integrating multiple views to improve learn- 
ing performance, they mainly exploit either the consensus principle or the complementary 
principle to ensure the success of multi-view learning. Since accessing multiple views is the 
fundament of multi-view learning, with the exception of study on learning a model from 
multiple views, it is also valuable to study how to construct multiple views and how to 
evaluate these views. Overall, by exploring the consistency and complementary properties 
of different views, multi-view learning is rendered more effective, more promising, and has 
better generalization ability than single- view learning. 

Keywords: Multi-view Learning, Survey, Machine Learning 



1. Introduction 



In most scientific data analytics problems in video surveillance, social computing, and envi- 
ronmental sciences, data are collected from diverse domains or obtained from various feature 
extractors and exhibit heterogeneous properties, because variables of each data example can 
be naturally partitioned into groups. Each variable group is referred to as a particular view, 
and the multiple views for a particular problem can take different forms, e.g. a) colour de- 
scriptor, local binary patterns, local shape descriptor, slow features and spatial temporal 
context captured by multiple cameras for person re-identification and global activity un- 
derstanding in sparse camera network, and b) words in documents, information describing 
documents (e.g. title, author and journal) and the co-citation network graph for scientific 
document management (see Figured]). 

Conventional machine learning algorithms, such as support vector machines, discrimi- 
nant analysis, kernel machines, and spectral clustering, concatenate all multiple views into 
one single view to adapt to the learning setting. However, this concatenation causes over- 
fitting in the case of a small size training sample and is not physically meaningful because 
each view has a specific statistical property. In contrast to single view learning, multi-view 
learning as a new paradigm introduces one function to model a particular view and jointly 
optimizes all the functions to exploit the redundant views of the same input data and 
improve the learning performance. Therefore, multi-view learning has been receiving in- 
creased attention and existing algorithms can be classified into three groups: 1) co-training, 
2) multiple kernel learning, and 3) subspace learning. 

Co-training (J Blum and Mitchell Il998l ) is one of the earliest schemes for multi-view 
learning. It trains alternately to maximize the mutual agreem ent on two distinct views 
of the unlabeled data. Many variants have since been developed. iNigam and Ghanil (J200d ) 
generalized expecta tion-maximiza t ion (EM) by assigning changeable probabilistic labels 



to unlabeled data. iMuslea et al.l (|2002al . l2003l . 120061 ) combined ac tive learn i ng w i th co - 



training and proposed robust semi-supervised learning algorithms. IYu et al.l (120071 . I2OIII ) 
developed a Bayesian undirected graph ical model for co-train ing and a novel co-training 
kernel for Gaussian process classifiers. IWang and Zhoul ( 2010l ) treated co-training as the 
combinative label propagation over two views and u nified the graph- and disagreement- 
based semi-supervised learning into one framework. ISindhwani et al.l (120051 ) constructed 
a data-dependent "co-regularization" norm. The resultant reproducing kernel associated 
with a single RKHS s implified the theore t ical a, nalysi s and extende d the algorit hmic scope 
of co-regularization. iBickel and Schefferl ( 2004 ) and iKumar et al.l ( 2010l . I2OIII ) advanced 
co-training for data clustering and designed effective algorithms for multi-view data. The 
success of co-training algorithms mainly relies on three assumptions: (a) sufficiency - each 
view is sufficient for classification on its own, (b) compatibility- the target function of both 
views predict the same labels for co-occurring features with a high probability, and (c) con- 
ditional independence- views are conditionally independent given the label. The conditional 
independence assumption is criti cal, but it is usually too strong to satisfy in practice and 
thus several weaker alternatives ( Abnevl . I2OO2I : iBalcan et al.l . |2004| : IWang and Zhoul . 120071 ) 
have been considered. 

Multiple kernel learning (MKL) was originally developed to control the search space ca- 
pacity of possible kernel matrices to achieve good generalization but has been widely applied 
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to problems involving multi-view data. This is because kernels in MKL naturally correspond 
to different v iews and combining kernels either linearly or non-linearly improves learning 
performa nce. Lanckriet et alJ ( 2002 . 120041 ) formulated MKL as a semi-definite programming 
problem. iBach et al.l ( 2004 ) treated MKL as a second order cone prograr n problem and de- 



veloped an SMO algorithm to efficiently obtain the optimal solution. ISonnenburg et al. 
( 2006a| ]bf) developed an efficient semi-infinite linea r prog ram and made MKL applicable to 



20081) proposed simple MKL by explor 



Szafranski eTaP (l2008l . [201ol ): IXu et"aL 



large scale problems. iRakotomamonjy et al.l (|2007l . 

i ng an adap tive 2-norm regularization form ulation. ^ ^ 

( 20ld ) and ISubrahmanya and ShinI ( 2010l ) constructed the connection between MKL and 
group-LASSO to model group structure. Many gener alization bounds have been obtained 



to theoretically guarantee the performance of MKL. iLanckriet et al.l ( 2004 ) showed that 



given k base kernels, the estimation error is bounded by 0{\J -^^), where 7 is the margin 
of the learned classifier. lYing and Campbelll ( 20091 ) used the metric entropy integrals and 
pseudo-dimension of a set of candidate kernels to estimate the empirical Rademacher chaos 
complexity. The generalization bounds have a logarithmic dependency on k for the family 
of convex combinat ions of k base kernels with the li constraint. Assuming different views 
to be uncorrelated, iKloft and BlanchardI (1201 ll ) derived a tighter up per bound by the local 
Rade macher complexities for the Zp-norm MKL. The cited survey (iGonen and Alpavdml . 



20 111 ) is believed to contain all the related references omitted from the proposal. 



Subspace learning-based approaches aim to obtain a latent subspace shared by multiple 
views by assuming that the input views are generated from this latent subspace. The dimen- 
sionality of the latent subspace is lower than that of any input view, so subspace learning is 
effective in reducing the "curse of dimensionality" . Given this subspace, it is straightforward 
to conduct the s ubsequent tasks , such as classification and clustering. Canonical cor relation 
analy sis (CCA) ( Hotellingl . 119361 ) and kernel canonical correlation analysis (KCCA) ( Akahd . 
20061 ) explore basis vectors for two sets of variables by mutually maximizing the correlations 



between the projections onto these basis vectors, so it is straightforward to apply them to 
two-view data to select the sh ared latent subspace. T hey have been further developed to 
cond uc t multi-view c luster ing ( Chaudhuri et al.l . l2009l ) and regression ( Kakade and Fosted . 



20071 ). iDiethe et al.l (120081 ) generalized Fisher's discriminant analysis to explore the latent 
subspace spanned by multi -view data. In contrast t o CCA , thi s generalization c onsiders 



the class label information. iQuadrianto and LampertI (1201 ll ) and lZhai et al.l (120121 ) studied 
multi-view metric learning by constructing embedding projections from multi-view data to 
a shared subspace, where the Euclidean distance is meaningful across different views. The 



l atent subspace is valuable for inferr ing another view f rom the observation view. IShon et al. 



tion, 



( 20061) exploited Gaus sian process, ISigal et al.l (J2009l ) maximized the mutual informat 
and lChen et al.l ( 2010l ) used Mar kov network to c o nstru ct the connections bet ween the two 
views through latent subspaces. ISalzmann et al.l ( 2010l ) and I Jia et al.l ( 2010l ) proposed to 
find a latent subspace in which the information is correctly factorized int o shared and pri 



vate parts across different views. Consistency and finite sa mple analysis ([Fukumizu et al 



2OO7I : iHardoon and Shawe-Ta^d^ . I2OO9I : ICai and Suil l201lh have been studied for KCCA. 



In reviewing the literature on multi-view learning, we find it is tightly connected with 
other topics in machine learni ng, suc h as a c tive learning , ensem ble learning and domain 
adaptation. Active learning (jSettlesl . l2009l : ISeung et al.l . Il992l ). sometimes called query 




Figure 1: Multi-view data: a) a web document can be represented by its url and words on 
the page, b) a web image can be depicted by its surrounding text separate to the 
visual information, c) images of a 3D object taken from different viewpoints, d) 
video clips are combinations of audio signals and visual frames, e) multilingual 
documents have one view in each language. 



learning, aims to mini r nize t he amount of labeled data required for learning a concept of 
interest. iMuslea et al.l (J200Cl ) introduced co-testing, which is a novel approach to conduct- 
ing active learning with m ultiple views. They co mbined co-testing with co-EM and derived 
a novel method co-EMT (JMuslea et al.l . l2002al ). which uses co-EM to generate accurate 
classifiers and chooses the most informative unl a.beled examples for co- Te sting to label. 
Furthermore, considering strong and weak views, iMuslea et al.l ( 20031 . |2006J) advanced co- 
Testing by assuming that the concentrated examples whose labels from strong classifiers are 
different and inconsistent with the prediction o f the weak classifier provide more informa- 
tion for labeling. The idea of ensemble learning ( Dietterichi 120021 : iLappalainen and Miskin . 
2000l) is to emp loy multiple learners and combine their predictions. The bagging algorithm 
( Breimanl . Il996l ) uses different training datasets to construct each member of the ensemble 
and predicts through uniform averaging or voting over class labels. In contrast to co- 
training, which ensures the diversities of the learned models by training on distinct views, 
bagging re quires different training data sets for generating models with different judgments. 
AdaBoost ( Freund and Schapirel . ll996l ) is another well-known ensemble learning algorithm, 
in which the principal idea is to train a new model to compensate for the errors made by ear- 
lier models. In each round, the misclassified examples are identified and their emphasis will 
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be increased in a new training set for the next training process. Both co-training and Ad- 
aBoost rely on a growing ensemble of classifiers trained on resamples of the data; however, 
AdaBoost tries to find the error-labeled examples, whereas co-training attempts to exploit 
the agreement of the learners. Co-training is confidence-driven whereas AdaBoost is error- 
driven. Domain adaptation refers to the problem of adapting a prediction model trained 
on data from a source domain to a difi^erent target domain, where th e data distributions 
in the two doma ins are different. Many domain adaptation techniques ( Wei and Pail201Cl : 



Wan et al.l . 1201 ll ) have been proposed to solve the cross-language text classification problem 



where the source domain includes the documents translated from the source language and 
the target domain includes the original documents in the target language. Moreover, these 
documents in different langua ges can be seen as different views of the original docu ment: 
thus, methods like co-training ( Wan . 20091). multi-view ma jority voting ( Amini et al.l , 12010 ) 



and multi-view co-classification (JAmini and Gouttd . l20ld ) have been designed and success- 
fully applied for this problem. 

In this survey paper, we provide a comprehensive overview of multi-view learning. The 
rest of this paper is organized as follows: we first illustrate the principles underlying multi- 
view learning algorithms in Section [2j In Section [3l different approaches to the construction 
of multiple views and methods to evaluate these views are introduced. We present different 
ways to combine multiple views in Section U] and illustrate different kinds of multi-view 
learning algorithm in detail in Sections El [6] and [71 The applications of multi-view learning 
are introduced in Section[8]and experimental results reflecting the performance of multi-view 
learning are shown in Section [9l Finally, we conclude the paper in Section [101 

2. Principles for Multi-view Learning 

The demand for redundant views of the same input data is a major difference between 
multi-view and single-view learning algorithms. Thanks to these multiple views, the learn- 
ing task can be conducted with abundant information. However if the learning method is 
unable to cope appropriately with multiple views, these views may even degrade the perfor- 
mance of multi-view learning. Through fully considering the relationships between multiple 
views, several successful multi-view learning techniques have been proposed. We analyze 
these various algorithms and observe that there are two significant principles ensuring their 
success: consensus and complementary principles. 

2.1 Consensus Principle 

Consensus principle aims to maximize the agreement on multiple distinct views. Suppose 
the available data set X has two views X^ and X'^. An example (xi ^in) is therefore viewed 



as {xj,xf,yi), where yi is the label associated with the example. iDasgupta et al.l (J2002l ) 



demonstrated the connection between the consensus of two hypotheses on two views re- 
spectively and their error rates. Under some mild assumptions they gave the inequality 

P{f / f) > max{P,rrif),Perr{f)}. 

From the inequality, we conclude that the probability of a disagreement of two independent 
hypotheses upper bounds the error rate of either hypothesis. Thus by minimizing the 
disagreement rate of the two hypotheses, the error rate of each hypothesis will be minimized. 



In recent years, a great number of methods appear to have utihzed this consensus 
principle in one way or another, even though in many cases the contributors are not aware 
of the relationship between their methods and this common underlying principle. For 
example, the co-training algorithm trains alternately to maximize the mutual agreement on 
two distinct views of the unlabeled data. By minimizing the error on labeled examples and 
maximizing the agreement on unlabeled examples, the co-training algorithm finally achieves 
one accurate classifier on each view. In the co-regularization algorithm, the consensus 
principle can be formulated by regularization terms as 

min ^[f\x,)-fix^)]'' + ^V{y„fix,)), (1) 

ieU i<=L 

where the first term enforces the agreement on two distinct views on unlabeled examples, 
and the second term evaluates the empirical loss on the labeled examples with respect to a 
loss function V{-, ■). By additionally considering the complexity of the hypotheses, we will 
achieve the complete objective function, and solving it will result in learning two optimal 
hypotheses. Observing that applying the kernel canonical correlati on analysis (KCCA) to 



the two feature spaces can improve the performance of the classifier, iFarquhar et al.l (|2005l ) 
proposed a supervised learning algorithm called SVM-2K, which combines the idea of KCCA 
with SVM. An SVM can be thought of as projecting the feature to a 1-dimensional space 
followed by thresholding, after which SVM-2K forces the constraint of consensus of two 
views on this 1-dimensional space. Formally this constraint can be written as 

where rji is a variable that imposes consensus between the two views, and e is a slack variable. 
In multi-view embedding, we conduct the embedding for multiple features simultaneously 
while considering the consisten cy and coniplen ient of different views. For example, the 
multi-view spectral embedding ( Xia et al.l . |2010|) first builds a patch for a sample on each 



view, in which the arbitrary point and its k nearest neighbors are forced to have similar out- 
puts in the low-dimensional embedding space. Following this local consensus optimization, 
all the patches from different views are unified as a whole by global coordinate alignment. 
This can be seen as a global consensus optimization. 

2.2 Complementary Principle 

The complementary principle states that in a multi-view setting, each view of the data 
may contain some knowledge that other views do not have; therefore, multiple views can 
be employed to comprehensively and accurately describe the data. In machine learning 
problems involving multi-view data, the complementary information underlying multiple 
views can be exploited to improve the learning performance by utilizing the complementary 

pri nciple. 

Nigam and Ghanil (2000) used the classifier learned on one view to label the unlabeled 



data, and then prepared these newly labeled examples for the next iteration of classifier 
training on another view. On the unlabeled data set, the models on two views therefore 
shared the complemen tary information with e ach other, which led to an improvement in the 



learning performance. IWang and Zhoul (|2007l ) studied why co-training style algorithms can 
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succeed when there are no redundant views. They used different configurations of the same 
base learner, which can be seen as another kind of view, to describe the data in different 
approaches, and showed that when the diversity between the two learners is greater than 
the amount of errors, the performance of the learners can be improved by co-training style 
algorithms. The two classifiers which have different biases will label some examples with 
different labels. If the examples labeled by the classifier hi on one view are to be useful for 
the classifier /12 on the other view, hi should contain some information that /12 does not 
know. The two classifiers will thus exchange complementary information with each other 
and learn from each other under the complementary principle. As the co-training process 
proceeds, the two classifiers will become increasingly similar, until the performance cannot 
be further improved. 

In multiple kernel learning, different kernels may correspond to different notations of 
similarity. Since different ways of measuring the similarity of the data have specific advan- 
tages, we resort to a learning method that makes an appropriate combination under the 
complementary principle rather than trying to establish which kernel works best. Thus all 
kinds of notations of similarity will work together to achieve an accurate evaluation of the 
data. In addition, different kernels can also use inputs from a variety views, possibly from 
alternative sources or modalities. Thus by considering the complementary information un- 
derlying various views of the data and combining multiple kernels from these distinct views, 
a comprehensive measurement of the similarity can be obtained. 

One traditional solution for the multi-view problem is to concatenate vectors from dif- 
ferent views into a new vector and then apply single-view learning algorithms straightfor- 
wardly on the concatenated vector. However, this concatenation causes over-fitting on a 
small training sample, and the specific statistical property of each view is ignored. For 
many applications with long feature vectors on more than one view as input, it is there- 
fore reasonable to construct a shared low-dimensional representation for these views. In 
human pose inference, the image features and 3D pos es can be se e n as two complen i entar: 



s 



views that describe human poses. Several methods (jShon et al.l . l2006l : ISigal et al.l . l200' 
have been designed to tackle this problem by constructing a latent subspace shared by 
multiple views, in which distinct views are connected with one another in this subspace, 
integrating the complementary information underlying different views. At inference, given 
a new observation on one view, it is possible to find the cor respondiii g laten t embedding 



which is also connected with the point on the other view. IXia et al.l ( 20ld ) developed a 



new spectral embedding algorithm, namely, multi-view spectral embeddi ng (MSE ) , which 
encodes multi-view features to achieve a physical meaningful embedding. IYu et al.l ( 2012bl ) 



proposed a semi-supervised multi-view distance metric learning (SSM-DML) for cartoon 
character retrieval. Since various low-level features can be extracted to represent the im- 
age, each feature space will give one measurement of similarity of the data, so it is difficult 
to decide which measurement is the most suitable. By considering the complementary in- 
formation underlying distinct views, advantage can be taken of metric learning to construct 
a shared latent subspace to precisely measure the dissimilarity between different examples. 
Both complementary and consensus principles play important roles in multi-view learn- 
ing. For example, in co-training style algorithms, iDasgupta et al.l ( 20021 ) have shown that 



by minimizing the disagreement rate of the two hypotheses on tw o views respectively, the 



error rate of each hypothesis can be minimized. On the other hand. lWang and Zhoul (|2007l ) 



established that the reason for the success of co-training style algorithms is the extent of the 
diversity between the two learners; in other words, it is the complementary information in 
distinct views that influences the performance of co-training style algorithms. In address- 
ing the problem of multi-view learning, both the consensus and complementary principles 
should be kept in mind to take full advantage of multiple views. 

3. View Generation 

The priority for multi-view learning is the acquisition of redundant views, which is also 
a major difference from single-view learning. Multiple view generation not only aims to 
obtain the views of different attributes, but also involves the problem of ensuring that the 
views sufficiently represent the data and satisfy the assumptions required for learning. In 
this section, we will illustrate how to construct multiple views and how to evaluate these 
views. 

3.1 View Construction 

In practice, objects can frequently be described from different points of view. One classic 
multi-view example is the web classification problem. Usually, a web document can be 
described by either the words occurring on the page or the words contained in the anchor 
text of links pointing to this page. In many cases, however, no natural multiple views are 
available because of certain limitations, so that only one view may be provided to represent 
the data. Since it is difficult to straightforwardly conduct multi-view learning on this single 
view, the preliminary work of multi-view learning concerns the construction of multiple 
views from this single view. 

Generating different views corresponds to feature set partitioning, which generalizes the 
task of feature selection. Instead of providing a single representative set of features, feature 
set partitioning decomposes the original set of features into multiple disjoint subsets to con- 
struct each view. A simple way to convert from a single view to multiple views is to split 
the original feature set into different views at rand om, and there i ndeed a number of experi- 



ments in multi-view learning eni ploying this trick ([Brefeld et al.l . l2005l : iBickel and Scheffed . 



2004 : iBrefeld and Scheffeii . l2004l ). However, there is no guarantee that a satisfactory result 



will be obtained using this approach. Therefore, subsetting the feature set in a way that 
adheres to the multi-view learning paradigm is not a trivial task, and is dependent on both 
the chosen learner and the data domain. 

The random subspace method (RSM) ( Hoi . Il998l ). as an example of a random sampling 



algorithm, incorporates the benefits of bootstrapping and aggregation. Unlike bagging the 
bootstraps training samples, RSM performs the bootstrapping in the feature space. This 
method relies on an autonomous, pseudo random procedure to select a small number of 
dimensions from a given feature space. This selection is made and a subspace is fixed by 
giving all points a constant value (zero) in the unselected dimensions, in each pass. For a 
given feature space of n dimensions, there are 2" such selections that can be constructed. 
All the subspaces can then be regarded as different views of the data. While most other 
methods suffer from the curse o f dimensionality, this method takes advantage of the high 
dimensionality. iTao et al.l ( 20061 ) employed the random subspace method to sample several 



small sets of features to reduce the discrepancy between the training data size and the 
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feature vector length. Based on the sampled subspaces, multiple SVMs can be constructed 

an d then be combined to ob tain a more powerful classifier to solve the over-fitting problem. 

Di and Crawford! ( 20121 ) conducted a thorough investigation of view generation for hy- 



perspectral image data. Considering the key issues: diversity, compatibility and accuracy, 
several strategies have been proposed to construct multiple views for hyperspectral data, 
as follows. 1) Clustering: these methods involve feature aggregation based on similarity 
metrics, with the goal of promoting diversity between views. 2) Random selection: in con- 
junction with feature space bagging, random selection can result in greater information 
exploration from the spectral feature space and can eliminate the impact of generating 
uninformative or corrupted views. 3) Uniform band slicing: uniform division of the data 
across the full spectral range creates views that contain bands separated by equal intervals, 
thus guaranteeing view sufficiency. The authors also proposed that increasing the number 
of views to increase diversity, or increasing randomness from the feature space to avoid 
insufficient or noisy views, further improves performance. 



With respect to learning problems involving textual documents. iMatsubara et al.l (|2005l ) 
proposed a pre-processing approach to easily construct different views required by multi- 
view learning algorithms. By identifying terms as bag-of-words and using different numbers 
of words to constitute each term, different representations of one document for different 
views can be obtained. This is a simple yet effective approach to the constructio n of multiple 



views for textual documents, although it is difficult to apply to other domains. I Wang et al. 



( 201ll ) developed a novel technique to reshape the original vector representation of a single 



view into multiple matrix representations. For instance, a vector x = [a, h, c, d, e, /] can 
be reshaped into two different matrices: 

a c e \ , { a h c 

b d f) "^'^ [d e f 

Different ways of reshaping the vector induce multiple matrix patterns with a variety of 
dimensional sizes of rows and columns. These matrixes can be regarded as multiple inde- 
pendent or weaker correlated views of the input data. Utilizing the matrix representation, 
the required memory can be saved, new implicit information is introduced through the 
new constraint in the structure, and then the performance of the classifiers learned will be 
im proved, com p ared t o the vector representation. 



Chen et al.l (120111 ) suggested a novel feature decomposition algorithm called Pseudo 
Multi-view Co-training (PMC) to automatically divide the features of a single view dataset 
into two mutually exclusive subsets. Considering the linear classifier, f{x) = wx + b given 
the weight vector w, the optimization can be written as 

min log(e^(^i'^)+e^(^2;L))^ (2) 

Wl,W2 

where wi and W2 are weight vectors for two classifiers respectively, and >C(w; L) is the log- 
loss over the dataset L. To make sure that the two classifiers are trained on different views 
of the dataset, for each feature i, at least one of the two classifiers must have a zero weight 
in the i-th. dimension. This constraint can be written as 

\/i,l<i<d, w\wl = 0. (3) 



In each iteration, solving the above optimization problem will automatically find an optimal 
split of the features. 



To obtain the feature subsets automatically, ISun et alj (J201lh turned to genetic algo- 
rithms (GAs) for help. Each bit in the binary bit strings in GAs is associated with one 
feature. If the i-th. feature is selected, the i-th. bit is 1, otherwise this bit is 0. Suppose the 
size of the population is n, then in each iteration, the best n individuals will be selected 
as the next generation. Each individual in the final genetic population corresponds to a 
candidate feature subset, which can be regarded as one view of the data. 

The literature shows that several kernel functions have been successfully used, such as 
the linear kernel, the polynomial kernel, and the Gaussian kernel. Since different kinds of 
kernel function correspond to different notations of similarity, it is reasonable instead of 
selecting one specific kernel function to describe the data to obtain an optimal combination 
of these kernel functions. These different kinds of kernel function can be seen as distinct 
views of the data, and the problem of how to learn the kernel combination can therefore be 
cast as multiple kernel learning. 

The above view construction methods can be analyzed and categorized into three classes. 
The first class includes techniques that construct multiple views from meta data through 
random approaches. The second class consists of algorithms that reshape or decompose the 
original single- view feature into multiple views, such as the above matrix representations or 
different kernel functions. The third class i s composed of me thods that perform feature set 



partitioning automatically, such as PMC (jChen et al.l . 120111 ). This last type of algorithrn 



bears some connections with th e mature feature selection algorithms (jJain and Zongkeii . 



19971 : iGuvon and Elisseefj . l2003l ): however, there are significant differences between multl 



view feature selection and single- view feature selection. In multi-view feature selection, 
the relationships between multiple views should additionally be considered, besides the 
information within each view. 

3.2 Viev^ Evaluation 

Constructing multiple views is just one task of view generation; another significant aspect 
is to evaluate these views and ensure their effectiveness for multi-view learning algorithms. 
Several approaches have been proposed in the multi-view learning literature that analyze the 
relationships between multiple views or cope with the problems resulting from the violation 
of view assumptions or the noise in the views. 

Muslea et al.l ( 2002bl ) first introduced a view validation algorithm which predicts whether 



or not the views are sufficiently compatible for solving multi-view learning tasks. This al- 
gorithm tries to learn a decision tree in a supervised way to discriminate between learning 
tasks according to whether or not the views are sufficiently compatible for multi-view learn- 
ing. A set of features is designed to indicate how incompatible the views are, and the 
label of each instance is generated automatically by comparing the accuracy of single- and 
multi-view algorithms on a test set. 

The assumption of view sufficiency does not generally hold in practice. For example, 
in the task of video concept detection, one frame contains an airplane and the other con- 
tains an eagle, but both frames may have the same color histogram feature. Therefore, 
it is difficult for the low-level visual features alone to sufficiently represent the concepts. 

10 
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Yan and Naphadd ( 20051 ) proposed semi-supervised cross-feature learning (SCFL) to alle- 



viate the problems of co-training when some views are inadequate for learning concepts 
by themselves. When view sufficiency assumption fails, the main concern in applying co- 
training is that the additional training data associated with classification noise are likely 
to corrupt the initial classifiers. After labeling unlabeled data using the initial classifiers 
of two views, two separate classifiers from each view, based solely on the unlabeled data, 
are trained to eliminate this problem. With the assistance of validation data V, all four 
classifiers can be weighted combined to detect how much benefit can be achieved from the 
unlabeled data without hurting the performance of the initial classifiers. If the predictions 
from the unlabeled data are too noisy to use, the combined weights of the two classifiers 
newly learned on the unlabeled data can simply be zeroed, and we back off to the initial 
classifiers trained on the labeled data. 

The p erformance of multi-view learning algorithms may be infiuenced by the noises in 
the views. IChristoudias et al.l ( 2008 ) defined a view disagreement problem, stating that the 



samples from each view do not always belong to the same class but sometimes belong to 
an additional background class as a result of noise. To detect and filter view disagreement, 
a conditional view entropy H{x^\x^) was defined as a measure of the uncertainty in view 
X* given the observed view x*. The conditional view entropy is expected to be larger when 
conditioning takes place on the background rather than the foreground. By thresholding the 
conditional view entropy, the samples whose views display disagreement can be discarded 
in each iteration of the co-training algorithm, and then the performance of the classifiers is 

im proved. 

Yu et al.l ( 201ll ) proposed a probabilistic approach to co-training, called Bayesian co- 
training, which copes with per-view noise. This algorithm employs a latent variable fj 
for each view and a consensus latent variable, fc to model the agreement on different 
views. Finally ip{fj,fc) is defined to denote the compatibility between the j-th view and 

I f ■ — f IP 
the consensus function and can be written as, ip{fj,fc) = exp{— ^ ■>" ). The parameters 

{(Tj} act as reliability indicators and control the strength of interaction between the j'-th 
view and the consensus latent variable. A small value of aj has a strong influence on the 
view in the final output, whereas a large value allows the model to discount observations 
from that view. Thus the Bayesian co-training model can handle per-view noise, where 
each sample of a g iven vi ew is assumed to be corrupted by the same amount of noise. 



Christoudias et al.l (|2009al ) extended Bayesian co-training to the heteroscedastic case, in 
which each observation can be corrupted by a different noise level. Assume that the latent 
functions can be corrupted with arbitrary Gaussian noise such that 

W„/c)=AA(/„A,), 

where Aj is the noise covariance matrix. When assuming i.i.d. noise, the noise matrix can 
be written as 

Aj =diag{alj,--- ,cr%j), 

where cr? • is the estimation of the noise corrupting sample i in view j. Thus the het- 
eroscedastic Bayesian co-training model can incorporate sample-dependent noise modeled 
by the per-view noise covariance matrices Aj. 

11 



In multiple kernel learning, different kernels may use inputs coming from various repre- 
sentations, possibly from a range of modalities or sources. These representations may have 
contrasting measures of similarity corresponding to different kernels, and can be regarded 
as different views of the data. In this case, combining kernels is one possible way to com- 
bine multiple information sources; however, in the real world, the sources may be corrupted 
by disparate noises, so when some of the kernels ar e noisy or ir r elevan t, it is necessary to 
optimize the kernel weights in the learning process. iLewis et al.l ( 2000 ) compared the per- 



formances of unweighted and weighted sums of kernels on a gene functional classification 
task. They considered a case in which additional, noisy kernels are added to the system. As 
more noise is added to the system, the performance of the unweighted average deteriorates, 
but the weighted kernel approach learns to down-weight the noise kernels and hence con- 
tinues to work well. Most multiple kernel learning algorithms are global techniques under 
the assumption of a per-view kernel weighting, and these methods therefore cannot cope 
with the presence of compl ex noise processes, such as heteroscedastic noise, or missing data. 
Christoudias et al.l (2009b|) presented a Bayesian localized approach for combining different 



feature representations with Gaussian processes that learns a local weighting over each view. 
Let X = [X^, • • • , X ] be the set of all observations with V views, let Y = [yi, • • • , y^]'^ be 
the set of labels, and let f = [fi, • • • , fAr]^ be a set of latent functions. The Gaussian Process 
(GP) prior over the latent functions can be written as p(f|X) = A/'(0,K). If a Gaussian 
noise model is used, then p(Y|f) = AA(f, cr^I) is obtained. The covariance function can be 
obtained by combining the covariances of feature representations in a non- linear manner; 
thus, classification is performed using the standard GP approach with common covariance 

fun ction. 

Liu and YuenI ( 201ll ) introduced two new confidence measures, namely, inter- view con- 



fidence and intra-view confidence, to describe the view sufficiency and view dependency 
issues in multi-view learning. Considering the sample X associated with M views, the ob- 
served data are represented as X^ , ■ ■ ■ ,X respectively; based on the mutual information 
definition, the inter-view confidence of X is defined as 



M M 



Cinter{X) - Z^Z^j^j^i^j^j 



1 r- 

where I{X^,X^) measures the mutual information between X'^ and X^ . By maximizing the 
inter-view confidence, the total data dependency is minimized. In addition, the authors pro- 
posed the calculation and minimization of the total inconsistency of labeled and unlabeled 
data iteratively in a semi-supervised manner. Thus the view sufficiency can be defined as 

_^ 1 

(^intra\X) 






^^F{xt,xf,s,y 

where X/' and Xf are the labeled and unlabeled dataset respectively. Si is the similarity 
matrix for view i, and F measures the data consistency between X^ and Xf . 

Correlation between views is a n impo rtant consideration in subspace-based approaches 
for multi-view learning. iHotellingi ( 19361 ) introduced canonical correlation analysis (CCA) 



to describe the linear relation between two views which aims to compute a low-dimensional 
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shared embedding of both views of variables such that the correlations among the variables 
between the two views is maximized in the embedded space. Since the new subspace is 
simply a linear system of the original space, CCA can only be used to describe linear relation. 
Under Gaussian assumption, the CCA can als o be used to test stochastic independence 
between two views of variables. lAkahd ( 20061 ) studied a hybrid approach of CCA with 



a kernel machine, called kernel canonical correlation analysis (KCCA), to identify non- 
linearly correlated projections between the two views. Formally, for two views X G R 
and Y G M'^^", CCA computes two projection vectors, Wx G M'^ and Wy G M*"', such that the 
following correlation coefficient: 



uy£ XY^ Wnj 



iwTXXTwx){w^YYywy) 

is maximized. Similarly in KCCA, we express the projection direction as Wx = Xa and 
Wy = Y/3, where a and f3 are vectors of size A^. Irrespective of whether CCA or KCCA 
is used, a sequence of correlation coefficients {pi, P2, • • • } arranged in descending order can 
be obtained. Several measures of association in the literature are constructed as functions 
of the correlation coefficients, of which the two most common association measures are as 
follows. One is the maximal correlation 

r{X,Y)=pi, 

and the other is 

r(X,y) = -^log(l-p2). 

4. View Combination 

One traditional way to combine multiple views is to concatenate all multiple views into 
a single view to adapt to the single-view learning setting. However, this concatenation 
causes over-fitting on a small training sample and is not physically meaningful because each 
view has a specific statistical property. Thus we resort to advanced methods of combining 
multiple views to achieve the improvement in learning performance compared to single- view 
learning algorithms. 

Co-training style algorithms usually train separate but correlated learners on each view, 
and the outputs of learners are forced to be similar on the same validation points, as shown 
in Figure [21 Under the consensus principle, the goal of each iteration is to maximize the 
consistency of two learners on the validation set. Certainly there may be some disagreement 
between the predictions from the two learners on the validation set; however, this disagree- 
ment is propagated back to the training set to help to train more accurate learners, thus 
minimizing the disagreement on the validation set in the next iteration. 

Co-training is a classical algorithm in semi-supervised learning. In co-training, a classi- 
fier is trained on per-view, which only uses the features from that view. By maximizing the 
agreement on the predictions of two classifiers on the labeled dataset, as well as minimizing 
the disagreement on the predictions of two classifiers on the unlabeled dataset, the classifiers 
learn from each other and reach an optimal solution. Here, the unlabeled set is considered 
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Figure 2: The process of co-training style algorithms. 



to be the validation set. In each iteration, the learner on one view labels unlabeled data 
which are then added to the training pool of the other learner; therefore, the information 
underlying two views can be exchanged in this scheme. Co-regularization can be regarded as 
a regularized version of the co-training algorithm. Unlike co-training, the co-regularization 
algorithm formally measures the agreement on two distinct views using Eq. [TJ By solving 
the corresponding objective problem, two optimal classifiers can be obtained. 

If a validation set is not provided, for example in an unsupervised learning setting, it is 
necessary to train the classifier on each view as well a s validate the combination of views on 
the same training set. iKumar and Daume IIll (J201lh applied the idea of co-training to the 
unsupervised learning setting and proposed a spectral clustering algorithm for multi-view 
data. Under the assumption that the true underlying clustering would assign corresponding 
points in each view to the same cluster, this algorithm solves spectral clustering on indi- 
vidual graphs to obtain the discriminative eigenvectors Ui(U2) in each view, then clusters 
points using Ui(U2) and uses this clustering to modify the graph structure in views 2(1) 
respectively. This process is rep eated for a number of iter ations. Similar to many other 



multi-view clustering algorithms (JKumar et al.l . 12010 . l201ll ) , multiple views in this setting 



are usually combined on the training set considering the consensus principle. In multi-view 
supervised learning problems, an implicit validation set is also e mployed to coin bine multiple 
views. For example, in the Bayesian co-training proposed by IYu et al.l (1201 ll ). a Bayesian 
undirected graphical model for co-training through gauss process is constructed. A latent 
function fc is introduced to ensure the conditional independence between the output y of 
each example and latent functions fj for each view. Thus {fc} can be seen as an implicit 
validation set which connects multiple views in a latent space. 

Instead of choosing a single kernel function for multiple kernel learning, it is better to 
use a set and allow an algorithm to choose suitable kernels and the kernel combination. 
Since different kernels may correspond to various notions of similarity or inputs coming 
from different representations, possibly from a number of sources or modalities, combining 
kernels is one possible way to integrate multiple information sources and find a better 
solution, as shown in Figure [3l There are several ways in which the combination can be 
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Figure 3: Sketch map of multiple kernel learning. 

made, and each has its own combination parameter characteristics. These methods can be 
grouped into two categories: 

1. Linear combination methods 

There are several linear ways to combine multiple kernels. These methods have two 
basic categories: 



M 



k=l 



Direct summation kernel 

Weighted summation kernel 

K{xi,Xj) = y^^dkKk{xi,Xj). 



(4) 



M 



(5) 



fc=i 



Using an unweighted sum gives equal preference to all kernels, which may not be ideal; 
a weighted sum may be a better choice. In the latter case, versions of this approach 



differ i n the way they place restrictions on the kernel weights {dk}^^i- iLanckriet et al. 



(|2002l . |2004| ) used a direct approach to optimize the unrestricted kernel combination 
weights. The combined kernel matrix is selected from the following set: 



M 



K = {K : K = Y^ dkKk, K>0, tr{K) < c}. 



k=l 



Lanckriet et al.l (|2004l ) restricted the combination weights to non-negative values by 



selecting the combined kernel matrix from the set: 

M 

IC = {K -.K = Y^ dkKk, dk>0,K>0, tr{K) < c} 



k=l 
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Thorsten Joachims and Shawe-Tavloii ( 200ll ) followed the constraints dk > 0, J2k=i ^k - 
1, and considered the convex combination of kernel weights. If only binary dk for ker- 
nel selection is allowed, t he kernels whose d^ = can be discarded and only the kernels 
whose dk = 1 are used. IXu et al.l (|2009bl ) used this definition to perform feature se- 
lection. Usually the same weight is assigned to a kernel over the whole input space , 
which ignores the data distribution of each local region. iGonen and Alpavdinl ( 20081 ) 
proposed to assign different weights to kernel functions according to data distribution, 
and defined the locally combined kernel matrix as 



M 



K{xi,Xj) = y^ dk{xi)Kk{xi,Xj)dk{xj), 



(6) 



fc=i 



where dk{x) is the gating function which chooses feature space as a function of input 

X. 

2. Nonlinear combination methods 

Linear combinations of base kernels are limited, t hus far richer representa tion can 
be achieved by combining kernels in other fashions. IVarma and Babul ( 20091 ) tried to 
use the products of base kernels and other combinations which yield positive definite 
kernels to perform multiple kernel learning; for example, the exponentiation and power 
way of combining kernels: 

M 



K{xi,Xj) = exp{- ^ dkxjAkXj), 



or 



k=l 



M 



K{xi,Xj) = {do + '^dkxjAkXj)"-. 



k=l 



Another work bv I Cortes et al.l ( 20091 ) is a non-linear kernel combination method based 
on kernel regression and the polynomial combination of kernels. They proposed to 
combine kernels as follows: 



M 



0<fciH \-kM<d,k,n>0 m=\ 



II -'^myXij Xj ) 
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A special case is considered: 



A^fci-fcM > 0- 
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Consequently, the objective of the algorithm is to find the vector /i = (/ii, • • • , ^m )^- 
However, the empirical results do not show consistent performance improvement, 
bringing into question whether the non-linear combination of kernel functions is nec- 
essary or efficient. 



Subspace learning-based approaches aim to obtain a latent subspace shared by multiple 
views by assuming that the input views are generated from this latent subspace, as illus- 
trated in Figure m In the literature on single-view learning, principal component analysis 
(PCA) is the time-honoured and simplest technique to exploit the subspace for single- 
view data. Canonical correlation analysis (CCA) can be viewed as the multi-view version of 
PCA, and it has became a general tool for performing subspace learning for multi-view data. 
Through maximizing the correlation between the two views in the subspace, CCA outputs 
one optimal projection on each view; however, since the subspace constructed by CCA is 
linear, it is impossible to straightforwardly apply this to many real world datasets exhibiting 
non-linearities. Thus the kernel variant of CCA, namely KCCA, can be thought of in terms 
of first mapping each data point to a higher space in which linear CCA operates. Both 
CCA and KCCA exploit the subspace in an unsupervised way, so that the label information 
is ignored. Motivated by the generation of CCA from PCA, multi-view Fisher discrimi- 



nant a nalysis is developed to find informative projections with label information. iLawrence 



( 2004 ) cast the Gaussian process as a tool to construct a la tent variable mode l which could 



accomplish the task of non- linear dimensional reduction. IChen et al.l ( 2010l ) developed a 



statistical framework that learns a predictive subs pace shared by multiple views ba sed on a 
generic multi-view latent space Markov network. iQuadrianto and LampertI ( 201ll ) studied 



the metric learning problem in cross-media retrieval tasks. The goal of metric learning for 
multi-view data is to learn metrics with which the original multi-view higher dimensional 
features can be projected into a shared feature space, so that the Euclidean distance in this 
space is meaningful not only within a single view, but also among different views. Since 
the subspace constructed through different methods usually has lower dimensionality than 
that of any input view, the "curse of dimensionality" problem is effectively eliminated, and 
given the subspaces, it is straightforward to conduct subsequent tasks such as classification 
and clustering. 

After analyzing the various approaches above to combine multiple views, we sum up 
their similarities and differences as follows, (a) Co-training style algorithms usually train 
separate learners on distinct views, which are then forced to be consistent across views. Thus 
this kind of approach can be regarded as a late combination of multiple views because the 
views are considered independently while training base learners, (b) Multiple kernel learning 
algorithms calculate separate kernels on each view which are combined with a kernel-based 
method. This kind of approach can be thought of as an intermediate combination of multiple 
views because kernels (views) are combined just before or during the training of the learner. 
(c) Subspace learning-based approaches aim to obtain an appropriate subspace by assuming 
that input views are generated from a latent view. This kind of approach can be seen as the 
prior combination of multiple views because multiple views are straightforwardly considered 
together to exploit the shared subspace. 
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Figure 4: Sketch map of subspace learning for multi-view data. 



5. Co-training Style Algorithms 

Co-training ( Blum and Mitclielll . ll998l ) was one of the earliest schemes for multi-view learn- 



ing. Since then, many variants have been developed. Besides the research on designing 
various algorithms, there are also a number of works on assumptions for co-training, which 
ensure the success of algorithms. 

5.1 Assumptions for Co-training 

Co-training considers a setting in which each example can be partitioned into two distinct 
views, and makes three main assumptions: (a) Sufficiency: each view is sufficient for clas- 
sification on its own, (b) Compatibility: the target functions in both views predict the same 
labels for co-occurring features with high probability, and (c) Conditional independence: 
the views are conditionally independent given the class label. The conditional indepen- 
dence assumption plays a critical role, but it is usually too strong to be satisfied in practice 
and several weaker alternatives have thus been considered. 

5.1.1 Conditional Independence Assumption 



Blum and Mitchelll ( 19981 ) proved that when two sufficient views are conditionally indepen- 
dent given the class label, co-training can be successful. They gave a theorem that if the 
concept classes Ci^2 on view Xi^2 are learnable in the PAC model in spite of the classifi- 
cation noise, and if the conditional independence assumption is satisfied, then (Ci,C2) is 
learnable in the co-training model from unlabeled data only, given an initial weakly-useful 
predictor h(xi). Specifically, let classification noise (a, /3) be a setting in which true positive 
examples are incorrectly labeled (independently) with probability a, and the true negative 
examples are incorrectly labeled (independently) with probability f3. Again define f{x) as 
the target concept and p = Pr£){f{x) = 1) as the probability that a random example from 
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D is positive. The sum of the two noise rates satisfies 

a + p<l-e^/{pil-p)), (7) 

with the probabihty at most 1 — 4e^. 

The above inequation gives some degree of justification for the co-training restriction 
on rule-based bootstrapping. However, it does not provide a bound on generahzation error 
as a function of empirica llv measurable quantit ies; hence, based on the same conditional 
independence assumption'Dasgu pta et al.l (|2002l ) gave the PAC style bounds for co-training. 



Let S be an i.i.d sample consisting of individual samples si, • • • , Sm- A partial rule h on 
a dataset X is a mapping from X to the label set {1, • • • ,k, _L}, where k is the number of 
class labels and _L denotes the partial rule h gives no opinion. Then with probability at 
least 1 — 5 over the choice of S we have the following for all pairs of rules hi and /12: if 
7i(/ii, h2,6/2) > for 1 < i < k then / is a permutation and for all 1 < i < A;, 

P{hi = i\f{y) = i, hi /±) < \ (e,(/ii,/i2,(5) + P(/ii / i\h2 = i, hi /±)), 

Ji{hi,h2,d) 

where 



ln2(|/^i| + |fe2|) + ln2/,5 
e.ihi,h2,6) = J 2|5(/.2=z,/.i/±)| 



7,(/ii, /i2, 5) = P{hi = i\h2 = i, hi /±) - Pihi / i|/i2 = i, hi ^±) - 2ei{hi,h2,5). (8) 

Note that if the sample size is sufficiently large (relative to \hi\ and |/i2|) then ej(/ii, /12, 5) 
is near to zero. Also note that if hi and /12 have near perfect agreement when neither is _L 
then 7i(/ii, /i2,5) is near one. The agreement between hi and /i2 upper bounds the error of 
hi. The co-training algorithm therefore needs to maximize the agreement on unlabeled data 
between classifiers based on different views under the conditional dependence assumption 
to improve the accuracy of each hypothesis. 

5.1.2 Weak Dependence Assumption 

The above-mentioned conditional independence assumption is overly strong to be satisfied 



for the two views in real applications. lAbneyi (120021 ) relaxed the assumptions and found 



that weak dependence alone can lead to successful co-training. Given the mapping function 
y = y, the conditional dependence of opposing-view rules hi and /i2 is defined as 

dy = \Y. I^^I^i = ^1^ = y,h2 = u]- Pr[hi = v\Y = y]\. 

u,v 

If hi and /12 are conditionally independent, then dy = 0. The hi and /i2 satisfy weak rule 
dependence just in case: 

dy <P2^ , 

where pi = mmuPr[h2 = u\Y = y],p2 = uiiiiu Pr[hi = u\Y = y], and qi = 1 — pi- 
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5.1.3 Expansion Assumption 

Balcan et al.l ( 20041 ) proposed a much weaker "expansion" assumption on the underlying 



data distribution and proved that it was sufficient for iterative co-training to succeed given 
appropriately strong PAC-learning algorithms on each feature set. Assume that examples 
are drawn from a distribution D over an instance space X. Let A""*" and X~ denote the 
positive and negative regions of X respectively. For Si C Xi and 52 ^ A2, let Sj(i = 1, 2) 
denote the event that an example (xi, X2) has Xi £ Si. If we think of Si and 5*2 as confident 
sets in each view, then Pr(Si A S2) denotes the probability mass on examples for which we 
are confident about both views, and Pr{Si S2) denotes the probability mass on examples 
for which we are confident about just one view. Define that D~^ is expanding if for any 
Si C A+ and 52 C X^ , 

Pr{Si e S2) > e min[Pr(Si A S2), Pr(Si A S2)]. (9) 

Another slightly stronger kind of expansion called "left-right expansion" can be defined as 

below. D~^ is right-expanding if for any Si C X^ and 52 ^ X2 , 

if 

Pr(Si)<l/2,Pr(S2|Si)>l-e, 

then 

/'r(S2)>(l + e)Pr(Si). 

D^ is left-expanding if above holds with indices 1 and 2 reversed. 

It can clearly be seen that if Si is the confident set in Xf and this set is small (Pr(Sj) < 
1/2), a classifier, which learns from positive data on the conditional distribution that 5, 
induces over X^-i[i = 1,2), is trained until it has error < e on that distribution. The 
definition implies that the confident set on X-^^i will have noticeably larger probability 
than 5j, so it is clear why this is useful for co-training. 

5.1.4 Large Diversity Assumption 



Goldman and Zhoul (2000) used two different supervised learning algorithms, and I Zhou and Li 



(|2005bl ) used two different parameter configurations of the same base learner for co-training 
style algorithms without redu ndant views, but neithe r of them had addressed the reasons 
of their successes. Afterwards, IWang and Zhoul ( 20071 ) showed that when the diversity be- 



tween the two learners is larger than their errors, the performance of the learner can be 
improved by co-training style algorithms. The difference d{hi,hj) between the two classi- 
fiers hi and hj implies the different biases between them, and the two classifiers will label 
some instances with different labels. If the examples labeled by the classifier hi are to be 
useful for the classifier hj, hi should know some information that hj does not know. In other 
words, hi and hj should have significant differences. As the co-training process proceeds, 
the two classifiers will become increasingly similar and the difference between them will be- 
come smaller as the two classifiers label more and more unlabeled instances for each other. 
The co-training process would therefore not improve performance further after a number of 
learning rounds. 
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5.2 Co-training 

Co-training was originally proposed for the problem of semi-supervised learning, in which 
there is access to labeled as well as unlabeled data. It considers a setting in which each 
example can be partitioned into two distinct views, and makes three main assumptions for 
its success: sufficiency, compatibility, an d conditional independenc e. 

In the original co-training algorithm ( Blum and Mitchelll . ll998l ). given a set L of labeled 



examples and a set U of unlabeled examples, the algorithm first creates a smaller pool U 
containing u unlabeled examples. It then iterates the following procedure. First, use L 
to train two naive Bayes classifiers hi and /12 on the view xi and X2 respectively. Second, 
allow each of these two classifiers to examine the unlabeled set U and add the p examples 
it most confidently labels as positive, and n examples it most confidently labels as negative 
to L, along with the labels assigned by the corresponding classifier. Finally, the pool U is 
replenished by drawing 2p + 2n examples from U at random. 

5.3 Co-EM 

The intuition behind the co-training algorithm is that classifier hi adds examples to the 
labeled set that classifier /i2 will then be able to use for learning. If the conditional inde- 
pendence assumption holds, then on average each added example will be as informative as 
a random example and learning should progress, subject to adding many examples belong- 
ing to the wrong class. If the independence assumption is violated, then on average the 
added examples will be less informative and co-training may not be successful. Instead of 
committing labels for the unlabeled examples, we thus choose to run EM in each view and 
give unlabeled examples probabilisti c labels that may change from one iteration to another. 
This is the principal idea of co-EM ( Nigam and Ghanil . 120001 ). 



Co-EM outperforms co-training for many problems, but it requires the algorithm to 
process probabilistically labeled training data and the classifier to output class probabili- 
ties. Hence, the co-EM algorithm has only been studied with naive Bayes as the underlying 
learner, even though Support Vector Machine (SVM) is known to better fit the character- 
istics of many classification problems. By reformulating the SVM in a probabilist i c wa; 



I 



and estimating the labels of unlabeled data with probabilities, iBrefeld and SchefFerl (1200 
successfully developed a co-EM version of SVM to close this gap. 

5.4 Co-regularization 

Suppose we have two hypothesis spaces, H^ and H^, each of which contains a predictor that 
well-approximates the target function. In the case of co-training, these two are defined over 
different representations, or "views", of the data, and trained alternately to maximize mu- 
tual agreement on unlabeled examples. More recently, several paper s have formulated those 
intuitions as ioint complexity re gularization, or co-regularization ( Sindhwani et al.l . l2005l : 



Sindhwani and Rosenberd . I2OO8I ). between H^ and H^ which are taken to be Reproducing 



Kernel Hilbert Spaces (RKHSs) of functions defined on the input space X. Given a few 
labeled examples {xi,yi)^^^ and a collection of unlabeled data {xjjigj/, co-regularization 
learns a prediction function, 

Mx) = l{fhx)+fhx)), (10) 
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where fl G H^ and /^ G H^ are obtained by solving the following optimization problem, 
(/i,/2) = rnin „^ 7i||/'ll|,i + 72||/'|||,2 + /.^^[/^x,) - /^(x,)]^ + ^ y(y„ /(x.)). 

In this objective function, the first two terms measure complexity by RKHS norms || • |||^ 
and II • 11^ in Hi and H2 respectively, the third term enforces agreement among predictors 
on unlabeled examples, and the final term evaluates the empirical loss of the mean function 
/ — (/^ + /^)/2 on the labeled data with respect to a loss function V{-, •). The real valued 
parameters 71, 72 and /i allow different tradeoffs between the regularization terms. L and 
U are index sets over labeled and unlabeled examples respectively. 

5.5 Co-regression 

Most studies on multi-view and semi-supervised learning focus on classificati on problems, 



and re gression problems can also be solved in a similar way. For instance, IZhou and Li 



(|2005al ) developed a co-training style semi-supervised regression algorithm called CoREG. 
This algorithm employs two k-nearest neighbor (kNN) regressors, each of which labels 
the unlabeled data for the other during the learning process. For the sake of choosing 
the appropriate unlabeled examples to label, CoREG estimates the labeling confidence by 
consulting the influence of the labeling of unlabeled examples on the labeled examples. The 
final prediction is made by averaging the reg ression estima t es gen erated by both regressors. 
Inspired by the co-regularization algorithm, iBrefeld et al.l ( 2000 ) proposed a co-regression 



algorithm. Formally given M views, the training instances {X^}^-^ with labels y{x) G M, 
and a finite set of instances Z <^ X for which the labels are unknown, we attempt to find 

/i : X — 7- M, • • • , /m : X — 7- M that minimize 

M M 

^(/) = E [ E ^(?^(^)' ^(^)) + ^ii^(-)f ] + ^ E E ^(/«(^)' /-(^))' 

ti=l x£Xv u,v=l z£Z 

where the norms measuring complexity are in the respective Hilbert spaces, V{y{x), fv{x)) 
evaluates the losses between the predictors and target values of labeled examples, and 
Vifu{z), fviz)) imposes the agreement among predictors on unlabeled examples. 

5.6 Co-clustering 

The co-training algorithm was originally designed for semi-supervised learning, but the idea 
of co-training can also be applied in unsupervised and supervised learning settings. Un- 
der the assumption that the true underlying clustering will assign corresponding points in 
each view to the same cluster, several clustering techniques have been developed using the 



multi-view approach. iBickel and Scheffed (|2004l ) studied a multi-view version of the most 



frequently used clustering approaches such as k-means, k-medoids, and EM. Taking k-means 
as an example: in each iteration, run k-means in one view, then interchange the partition 
information to another view and run k-means in the second view again. After termina- 
tion, compute a consensus mean for each cluster and view, then assign each example to 
one distinct cluster that is determined through the closed concept vector. Considering that 
spectral clustering algorithms have good performance on arbitrary shaped clusters and a 
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well-defined mathematical framework, s ome methods are de s igned to utilize the idea of co- 
train ing to conduct spectral clustering (Kuinar et alj . l201Ci . 1201 ll : iKumar and Daume III 



20111). For example. iKumar and Daume IIll (J201ll ) developed a multi-view spectral cluster- 
ing algorithm which solves spectral clustering on individual graphs to obtain the discrimi- 
native eigenvectors Ui(U2) in each view, then clusters points using Ui(U2) and uses this 
clustering to modify the graph structure in views 2(1) respectively. This process is repeated 
for a number of iterations. 



5.7 Graph-based Co-training 

Most co-training style algorithms focus on how to minimize the disagreement between two 
classifiers in order to obtain satisfactory performance of multi-view learners, thus these 
methods can be seen as disagre ement-bas e d approache s. Graph-based methods for co- 



training also exist; for instance, IYu et all (J2007l . l201ll ) proposed a Bayesian undirected 



graphical model for co-training through Gaussian process (GP). Suppose we have m different 
views of n data examples {xi} with outputs {yi}. Let fj denote the latent function for the 
j-th view, and let fj ~ GP{0,k) be its GP prior in view j. A latent function fc is then 
introduced to ensure conditional independence between the output y and the m latent 
functions fj for the m views. At the functional level, the output y depends only on /c, and 
latent functions fj depend on each other only via the consensus function fc- That is, we 
have the joint probability: 



^ m 

p{yjc,fi,--- Jm) = -^'4^{yJc)Y{ip{fj,fc)- 



(11) 



i=i 



In the ground network with n data examples, let fc = {fcixi)}^=i and fj = {fj{xl)}f^i. 
The graphical model leads to the following factorization: 



^ m 

p(y,fc,fl,--- ,fm) = -^Yii^iyi, fc{xi))Y[^i^j)'^i^j^Q■ 



m 



i=i 



Here, the within- view potential tp{fj) specifies the dependency structure within each view 
j, and the consensus potential ip{{j,fc) describes how the latent function in each view is 
related to the consensus function fc- Employing a GP prior for each of the views, we can 
define the following potentials: 



V'(f,) = exp(--fjK-if,-), ^(f^.^f^ 



2 ^ J 



exp 



2a] 



(13) 



Integrating all the m latent functions in Eq. ()12p . we get the co-training kernel for multi- 
view learning as 



K. 



[E(K. + -|i)-^ 



(14) 



This co-training kernel reveals a previously unclear insight into how the kernels from dif- 
ferent views are combined in multi-view learning and allows us to solve GP classification 
simply. 
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Wang and Zhoul ( 20101 ) treated the co-training process as a combinative propagation 



over two views and unified the graph- and disagreement-based semi-supervised learning 
into one framework. In one view, the labels can be propagated from the initial labeled 
examples to unlabeled examples, and these newly-labeled examples can be added into the 
other view. The other view can then propagate the labels of the initial labeled examples 
and these newly labeled examples to the remaining unlabeled instances. This process can 
be repeated until the stopping condition is met. 

5.8 Multi-learner Algorithms 

Goldman and Zhoul (2000) presented a new "co-training" strategy for using unlabeled data 



to improve the performance of standard supervised learning algorithms. Without assuming 
that both of the views are sufficient for perfect classification, the only requirement of this co- 
training strategy is that its hypothesis partitions the example space into a set of equivalence 
classes. Assume that A and B are two different supervised algorithms, U are unlabeled data, 
L are the original labeled data, L^ are the data that B labeled for A, and Lb are the data 
A labeled for B. At the start of each iteration, train A on the labeled examples L|JLa 
to obtain the hypothesis Ha- Similarly, train B on L[JLb to obtain Hb- Each algorithm 
considers each of its equivalence classes and decides which to use to label data from U for 
the other algorithm. This co-training algorithm repeats until neither L^ nor Lb change 
du ring an iteration. 



Zhou and Lil (l2005bl ) proposed another co-training style semi-supervised algorithm called 
tri-training, which does not require that the instance space be described with sufficient 
and redundant views, nor does it put any constraints on the supervised algorithm, as do 



Goldman and Zhoul (2000|). Tri-training generates three classifiers from the original labeled 



example set which are then refined using unlabeled examples in the iterations. For each 
iteration, an unlabeled example is labeled for a classifier if the other two classifiers agree 
on the labeling, under certain conditions. 

The performance of traditional SVM-based relevance feed back approaches is often poor 
when the number of labeled feedback samples is small, thus iLi et al.l (120061 ) developed a 



new machine learning technique, namely multi-training SVM (MTSVM), to mitigate this 
problem. MTSVM combines the merits of the co-training technique and a random sampling 
method in the feature space. However, simply using the co-training algorithm with SVM 
is not realistic, because the co-training algorithm requires that the initial sub-classifiers 
have good generalization ability before the co-training procedure commences. Thus the 
authors employed classifier committee learning to enhance the generalization ability of each 
sub-classifier. Initially, a series of subsets of feature - in other words, multiple views of the 
data can be obtained from the original input feature using the random subspace method. 
Multiple classifiers can then be learned on these generated views and can train one another 
in a semi-supervised relevance feedback setting. Finally, the majority voting rule is used to 
generate the optimal classifier. 

6. Multiple Kernel Learning 

Multiple Kernel Learning (MKL) was originally developed to control the search space capac- 
ity of possible kernel matrices to achieve good generation, but it has been widely applied to 
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problems involving multi-view data. This is because kernels in MKL naturally correspond 
to different views and comb ining kernels appropriately may improve learning performance. 



Gonen and Alpaydml (J201ll ) have reviewed the literature on MKL. Since MKL can be re- 
garded as just one part of multi-view learning, we place more weight on the connections 
between MKL and those parts; in this section, we illustrate the representative MKL algo- 
rithms and theoretical studies to present a complete picture in this survey. 

6.1 Boosting Methods 

Inspi r ed by ensemble and b oosting methods ( Duffv and Helmboldl . I2OO0I : iFriedman et al. 



2nnil ). iBennett et aD tooj ) proposed the Multiple Additive Regression Kernels (MARK) 



algorithm which considers a large library of kernel matrices formed by different kernel 
functions and parameters. The decision function is modified as 



N M 

f{x) = Y.Y.'^'.K,{xT,x^) + b, (15) 

i=l fc=l 

which is composed of a linear combination of heterogeneous kernel functions Ki , • • • , Km , 
and each kernel can be of any type; for example, {i^fc} could be RBF kernels with different 
parameters. Like ensemble methods, each column of the kernel is treated as a hypothesis 
and the kernel columns are generated on the fly. Gradient-based ensemble algorithms, such 
as gradient boosting, can be adapted to this optimization problem. 

Column Gene ration (CG) tec hniques have been widely used for solving large scale linear 



programs (LPs). iBi et al.l (J2004l ) used the 2- norm regularization approach to extend LP- 
Boost to a quadratic program (QP), so that many successful formulations, such as classic 
SV Ms, ridge regres s ion, e tc, could benefit from CG techniques. 



Crammer et al.l (120021 ) used the boosting paradigm to perform the kernel construction 
process. Since numerous interpretations of AdaBoost and its variants regard the boosting 
process as a procedure that attempts to minimize classification error, the boosting method- 
ology can be modified to work with kernels by rewriting the loss functions for a pair of 
examples (xi,yi) and (x2,y2) as 

ExpLoss{K{xi,X2),yiy2) = ex.p{-yiy2K{xi,X2)) 
LogLoss{K{xi,X2),yiy2) = log{l + exp{-yiy2K{xi,X2))). 

A pair of instances is viewed as a single example and pairs of the same labels are regarded 
as positively labeled examples, while pairs of opposite labels are seen as negatively labeled 
examples. Along similar lines to boost algorithms for classification, the combined kernel 
matrix can be updated iteratively using one of these two loss functions. 

6.2 Semi-Definite Programming 

The general form of Semi-Definite Programming (SDP) is 

min c X (16) 

X 

S.t. F{x)=Fo + XiFi + ---+XnFn>0 

Ax = b, 
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where x G M^ and Fi = F(^ G M^^^. Note that the object is hnear in the unknown x, and 
that both inequahty and equahty constraints are hnear in x. 

Lanckriet et al.1 ( 20021 . |200J) showed how the kernel matrix can be learned from data 



via SDP techniques. In particular, if all labels of data are known, the task is to find the 
kernel matrix K which is maximally aligned with the set of labels y, and then this problem 
is formulated as 



max {K,yy^) (17) 

s.t . trace{A) < 1 



K>0. 



Given the labeled training set 5„(,, = {{xi,yi), • • • , {xntriUntr)} ^^^ the unlabeled test set 
Tnt{xntr+i7 ' ' ' '^"tr+n*}' formally, we consider a kernel matrix has the form: 



where Kij = {(j){xi)^4>{xj)), i,j = !,••• ,ntr,ntr + l,ntr + nt- The goal is then to learn 
the optimal mixed block Ktrt and the optimal "test data block" Kt by optimizing a cost 
function over the "training data block" Ktr- Under the constraint K = X]i-i/"«-^i> where 
the set /C = {Ki, • • • , Km} is given and fn are to be optimized, we can replace K with Ktr 
in Eq. (J17p and obtain the SDP formulation for learning the kernel matrix. 

6.3 Quadratically Constrained Quadratic Program (QCQP) 



Bach et al.l ( 2004 ) introduced a novel classification algorithm called support kernel machine 



(SKM). Given a decomposition of M^ as a product of m blocks: R'^ = M.''^ x • • • x R'^™, then 
each data Xi can be decomposed into m block components, Xi = {xu, • • • , Xmi}- The aim is 
to find a linear classifier, y = s±ga{w^x + b), where w = wi, • • • , Wm- To obtain the sparsity 
of the vector w and make most of the components in w zero, the 1-norm and 2-norm are 
used to penalize w. Thus the primal problem can be formulated as follow: 



^ m n 

min -(^d,\\w,hf + CY,i^ (19) 

w.r.t. u; gR'' = R^i X ••• xR''", ^eM![, 6eR 
s.t. yiC^w^Xji + h) > 1 -Ci,Vi G {I,--- ,n}. 
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This optimization problem can be seen as a second order cone program (SOCP) problem, 
and then the dual problem is given by: 



1 



mm 

w.r .t . 
s.t. 



2 

7G 



2 T 

7 — a e 



(20) 



a G 



< a < C, a^y = 



^OiiyiXji\\2 < djj,\/j G {I,--- ,rn}, 



which is exactly equivalent to the QCQP formulat ion oflLanckriet et alj (|2004l ). However, 
the advantage of this SOCP formulation is that iBach et al.1 (|2004l ) developed an SMO 
algorithm for the SKM with Moreau-Yosida regularization, and transformed the primal 
problem as: 



mm 

w.r .t . 
s.t. 



^ m ^ n 



j=i 



2 / ^-]\\^]\\2 

j 



iM E M" = M"^ X • • • X M"", C G M+, beR 
ViC^wJxji + b) > 1 -^i,Vi G {I,--- ,n}, 



(21) 



where {oj} are the MY-regularization parameters. 



6.4 Semi-infinite Linear Program (SILP) 

Sonnenburg et al.l ( 2006aH bl) followed a different direction and formulated the problem as 
a semi-infinite linear program (SILP). Beginning with Eq. (I20p . the equivalent multiple 
kernel learning dual is modified as: 



mm 

w.r .t . 

s.t. 



which may be solved by 



7 

7 G M, aeMJ" 

< Q < C, a^y = 

N N 

2^^ O^iO^jViyj^kiXi, xf) 
i=l j=l 



N 



X]"i -^' 



i=l 



Sk{a) 



M 



L = j + Y,MSk{a)-j) 



fc=i 



(22) 



(23) 
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minimized w.r.t a and maximized w.r.t /?. Setting the derivative w.r.t. to 7 to zero, the 
constraint Y2k A = 1 is obtained. Eq. (|23p can then be simpUfied to a min-max problem 

M 

max min > fij^Sk{a) (24) 

^ fc=l 

M 



s.t. y^ajT/j = y^/3fc = 1. 



i=l fc=l 

Assume that a* is the optimal solution, and given the definition of 9 = L = S{a*,f3), Eq. 
([23j) is equivalent to the following SILP problem: 

max e (25) 

s.t. ^/3fc = l, J^/3fc5fc(a)>^ 



Va < a < C, y^ yiOj = 0, 



where 9 and /? are only linearly constrained, but there are a large number of constraints 
due to the possible values of a. 

Compared to the SDP and QCQP, the SILP formulation has a lower computational 
complexity, and this SILP problem can be efficiently solved using an off-the-shelf LP solver 
and a standard SVM implementation. Thus it allows us to efficiently handle more than a 
hundred thousand examples or several hundred kernels. 

6.5 Simple MKL 



Rakotomamonjy et al.l ( 200?1 . I2OO8I ) departed from the framework proposed by lBach et al. 



( 20041 ) and presented a different primal problem for multiple kernel learning through 



an 



adaptive 2- norm regulari zation formulation. Inspired by the multiple smoothing splines 
framework ( Wahbal . Il990l ). the proposed primal formulation is 



min Y,h\^kf + CY,Ci (26) 



k 



S.t. yiC^Wkx\ + b)>l-ii 

k 

5^4 = 1 



k 

^i >0,dk > 0Vi,V/c. 

Note that the dk controls the smoothness of kernel function, and the 1-norm constraint on 
the vector d will lead to a sparse decision function with few basis kernels. By Defining the 
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optimal SVM objective value J{d) as 



mm 



s.t. 



k i 

k 



(27) 



the primal optimization problem can then be reformulated as 



mm 

d 



in J{d) s.t. y^ dk = l,dk > 0. 



(28) 



The overall procedure to solve this problem consists of two steps: first, solving a canon- 
ical SVM optimization problem J{d) with given d] second, updating d by gradient descent 
while ensuring that the constraints on d are satisfied. This novel multiple kernel learning 
framework is called simple MKL, which has been shown to be more efficient than the SILP 
problem. 

Chapelle and Rakotomamonivi ( 20081 ) investigated the use of second order optimization 



approaches to solve the MKL problem, and propose hessian MKL as an extension of simple 
MKL. In each iteration, hessian MKL updates the kernel weights using a Newton step found 
by minimizing a QP problem. The result shows that hessian MKL outperforms simple MKL 
in terms of computational efficiency. 

The SILP approach often suffers from slow convergence because it updates kernel weights 
based only on the cutting plane model. The simple MKL is efficient; however, it does not 
use the gradients compu ted in previous it erations, which can be useful in improving the 
efficiency of the search. IXu et al.l ( 2009al ) extended the le vel method, and applied it to 
multiple kern el learning to overcome the dr awbacks of SILP ( Sonnenburg et al.l . l2006bl ) and 
simple MKL ( Rakotomamoniv et al.l . 120071 ). Following the SILP method, this algorithm has 
an extra step to adjust the solution for kernel weights obtained from a cutting plan model, 
through a projection to a level set. This adjustment ensures the new solution is close to the 
current solution and reduces the objective function. 



6.6 Group-LASSO Approaches 

It is reasonable to consider the group structure between the combined kernels when the 
kernels can be partitioned into groups which correspond to subsets of inputs or sources. 
In the learning process, it is desirable to suppress the kernels or groups that are irrelevant 
for the classification task, otherwise all the kernels belongi ng to the same groups which 
are relevant to the task will be selected. Based on this idea, ISzafranski et al.l ( 20081 . l20ld ) 
developed the Composite Kernel Learning (CKL) approach, which extends the multiple 
kernel learning problem to take into account the group structure among kernels and con- 
structs the relationship with group-LASSO ( Yuan and Linl . 120061 ). The MKL formulation 
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of iRakotomamoniv et al.1 ( 20081 ) is modified to obtain the following formulation of CKL: 



mm 



s.t. 




< 1 



where p and q are set according to the problem at hand, G denotes one subset of kernels, and 
||G|| is the size of group G. Note the third constraint: in particular cases where p = 0,q = 1, 
a LASSO type penalty is imposed on the RKHS norms, and when p = l,q = 0, a group- 
LA SSO type penalt y is imposed on the RKHS norms. 

Xu et alJ ( 2010l ) discussed the connection between multiple kernel learning and the 



group-LASSO regularizer, and generalized MKL to Lp-MKL which constrains the p-norm 
kernel weights. This proposed algorithm provides a unified solution for the entire family 
of Lp models, besides which the kernel weights can be calc ulated by a closed-form formu- 
lation without dependence on other commercial software. ISubrahmanva and ShinI ( 2010l ) 



proposed an algorithm called Sparse Multiple Kernel Learning (SMKL), which general- 
izes group-feature selection to kernel selection by introducing a log-based penalty over the 
groups. This method can automatically select the optimal number of sources from a large 
candidate list with a sparser solution compared to the existing multiple kernel learning 
framework. 

6.7 Bounds for Learning Kernels 

The most common family of kernels examined in multiple kernel learning is that of non- 
negative or convex combination of some fixed kernels constrained by a trace condition, 
which can be viewed as an Li or L2 regularization, or Lp regularization with other values 

oi p. 

Lanckriet et al.l ( 20041 ) showed that when a kernel is chosen from a convex combination of 



k base kernels, the estimation error of the learned classifier is bounded by 0{\/ -^), where 
7 is the margin of the learned classifier under the kernel. This bound converges and can be 
viewed as the first informative generalization bound for this family of kernels; however, the 
multiplicative interaction between the margin complexity term 1/7^ and the number of base 
kernels k does not encourage the use of too many base kernels. It suggests that learning 
even a few kernel para. n ieters leads to a multiplicative increase in the required sample size. 
Srebro and Ben-DavidI ( 2006 ) presented a generalization bound for a kernel family with 



pseudo-dimension of dtp. The pseudo-dimension of most kernel families is similar to our 
intuitive notion of the dimensionality of the family; in particular, the pseudo-dimension of 
a family of linear or convex combinations of k base kernels is at most k. The estimation 
error for SVMs with margin 7 is bounded by ^yO{d^p + l/^y^)/n, which establishes that the 
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bound on the required sample s ize, 0(d^ + l/'y ) grows only additive with the dimensionality 



of the allowed kernel family. lYing and Campbelll (|2009l ) showed that the generalization 



i) 



analysis of the regularized kernel learning system reduces to investigation of the suprema 
of the Rademacher chaos process of order two over candidate kernels, and they used metric 
entropy integrals and the pseudo-dimension of the set of candidate kernels to estimate the 
empirical Rademacher chaos complexity. For a pseudo-dimension of A;, as in the case of a 
convex combination of k base kernels, their bound is in 0{y^ k{R'^ / p'^){log{m) /m)) and is 
thus multiplicative in k. Based on a cor nbinatorial analysis of the Rademacher complexity of 
the hypothesis set under consideration. ICortes et al.l ( 2010l ) presented another generalization 



bound with an Li constraint that has only a logarithmic dependency on the kernel number 
k. The bound is in 0{\/ ^^ ^ ), thus it is valid for a very large number of kernels, in 
particular for k ^ m, and it contains only a ^/logk dependency on the number of kernels, 
which is tight and considerably more favorab le. Assuming the different v iews corresponding 



to the different kernels to be uncorrelated, iKloft and BlanchardI (|201ll ) derived an upper 



bound on the local Rademacher complexity of Lp-norm multiple kernel learning. Given the 
number of kernels M and the radius D, the bound for centered identical independent kernels 



is of the order 0{\lY^'^ niin {rM, D'^Mf* \j))- From the upper bound, a tighter excess risk 
bound than previous approaches is obtained, which achieves a fast convergence rate of the 
order 0{n i+q ), where a is the minimum eigenvalue decay rate of the individual kernels. 

7. Subspace Learning-based Approaches 

Subspace learning-based approaches aim to obtain a latent subspace shared by multiple 
views by assuming that the input views are generated from this subspace. Besides the well 
known canonical correlation analysis (CCA), other more effective methods to construct the 
subspaces have recently become available. 

7.1 Algorithms based on CCA 

Canonical correlation analysis (CCA) is a technique for modeling the relationships between 
two (or more) sets of variables, and it has been applied with great success on a variety of 
learning problems dealing with multi-view data. 

7.1.1 A REVIEW OF CCA 

For X £ M-^i^^ and Y £ R-^^xAf^ qqj^ computes two projection vectors, w^ G M^i and 



w 



S M 2^ such that the following correlation coefficient: 



P = , ^^XY^^y (30) 

'{wTXXTw,){w^YYywy) 

is maximized. Since p is invariant to the scaling of Wx and Wy, CCA can be formulated 
equivalently as 

max w^XY Wy (31) 

Wx,Wy 

s.t. wIXX^w^ = l, w'!^YY^w,, = l. 



y 
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Assuming YY is nonsingular, then Wx can be obtained by solving the following optimiza- 
tion problem: 

max w^XY^{YY^)-^YX^Wy (32) 

Wx,Wy 

s . t . vj^XX'^Wx = 1 

Both formulations in Eqs. (j3ip and (|32p attempt to find the eigenvectors corresponding 
to the top eigenvalues of the following generalized eigenvalue problem: 

XY^{YY^)-^YX^Wx = T]XX^Wx, (33) 

where t] is the eigenvalue corresponding to the eigenvector Wx- 

7.1.2 Kernel CCA 

Canonical correlation analysis (CCA) is a linear feature extraction algorithm, but for many 
real world datasets exhibiting non-linearities, it is impossible for a linear projection to 
capture the properties of the data. Kernel methods provide a way to deal with the non- 
linearities by mapping the data to a higher dimensional space and then applying linear 
methods in that space. 

Formally given a pair of datasets X G M^i^^ and Y G R-^^^^, CCA seeks to find 
linear projections Wx G M-^^ and Wy G M^^ such that, after projecting, the corresponding 
examples in the two datasets are maximally correlated in the projected space. To obtain 
the kernel formulation of CCA, dual representation is engaged by expressing the projection 
direction as Wx = Xa and Wy = Y/3 where a and /3 are vectors of size A^. In the dual 
formulation, the correlation coefficient between X and Y can be written as: 

a^X'^XY^Yf] 

p = max — , { 34 ) 

,/3 ^/a^x^xxTxa x /3^y^yy^y/3 



a,> 



Now using the fact that Kx = XX and Ky = Y Y are the kernel matrices for X and 
Y, kernel CCA amounts to solving the following problem: 

a^KxKy(3 
max = — = (35) 

s.t. a^Kla = l, P^kIP = 1. 

KCCA works by using the kernel matrices Kx and Ky of the examples in the two views 
X and Y of the data. In contrast to linear CCA, which works by carrying out an eigen- 
decomposition of the covariance matrix, the eigenvalue problem for KCCA is given by: 

For the case of a linear kernel, KCCA reduces to the standard CCA. 

KCCA can isolate feature space directions that correlate between the two views and 
might be expected to represent common relevant information; therefore, experiments have 
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shown that KCCA could be an effective preprocessing step to improve the performance of 
classification algorithms such as Support Vecto r Mach ine (SVM). Combining KCCA with 
SVM into a single optimization, iFarquhar et alJ (J2005l ) proposed a method called SVM-2K, 



which can be seen as the global optimization of two distinct SVMs, one in each of the two 
feature spaces. Slightly different from the 2-norm that characterizes KCCA, SVM-2K takes 
an e-insensitive 1-norm using slack variable to measure the amount by which points fail to 
meet e similarity: 

|(wa,</)a(xj)) +bA- (wb,0b(xj)) -6b| < r/j + e, 

where wa, bA and w^, bs are the weight and bias of the first and second SVM respectively. 
Then with the usual 1-norm SVM constraints, the objective problem can be written as: 

r II Il2 I II ||2 /o'7\ 

mm L = — ||w^|| -|- ||wb|| (o7j 

i i i 

s.t. \{\^A,(t>A{'y^i)) +bA- {wBABi^i)) -bsl <r]i + ^ 

yr{\{^A.4>A{^i))+bA)>l-it 
yi{\{^BAB{^i))+bB)>l-if 



The final decision function is 



/(x) = 1(/^(x) + /b(x)). (38) 



7.1.3 Theoretical analysis of CCA 

Canonical correlation analysis (CCA) can be viewed as finding basis vectors for two sets 
of variables such that the correlations between the projections onto these basis vectors 
Xa = w^4)a{x) and 2/6 = w'[ (pbdi) are mutually maximized. KCCA uses the kernel trick to 
produce a non-linear version of CCA, by looking for functions f £ H^ and g £ Hy such that 
the random variables /(x) and g{y) have maximal correlation. This leads to the kernelised 
form, KCCA 

max Cov[f{x),g{y)] 

yar[/(x)]V2yarb(y)]i/2- ^""^^ 

In practice, we have to estimate the desired function from a finite sample, thus an empirical 
estimate of Eq. ([39]) is 



max^.^ g^^[/(^):£(^)] , (40) 

{Var[f{x)]y^+eJf\\j,J{Var[g{y)]y^ + eJgrH) 



where £n is the regularization coefficient and n is the number of examples. iFukumizu et al. 



( 20071 ) investigated the general problem of establishing a consistency of KCCA by providing 



the rates for the regularization parameter, and proved that when 



n-i/3 
lim e„ = 0, lim = 0, 
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for the decay of the regularization coefficient e„ , the convergence in the L2 norm for kernel 

CC A is ensured. 

Hardoon and Shawe-Tavlori ( 2003 ) proposed a finite sample statistical analysis of KCCA 



by using a regression formulation. By computing the empirical expected value oiga^b{x, y) := 
E[\\W'^ <j)a{x)—W'^ 4>i,{y)\\'^]„ the error bound on new data can be obtained by using Rademacher 
complexity. Formally, given a paired training set 5 = {(xj, yj)} of size C in the feature space 
defined by the bounded kernels ka and kh drawn i.i.d according to a distribution 2?, then 
with probability greater than 1 — 5 over the generation of S, the expected value of ga,b{x-, y) 
on new data is bounded by 



'In 2/5 

ED[ga,b] < ED[ga,b]+^RA\l^^ (41) 



+ '^'^-?^J'^i'^a{xi,Xi) + kb{yi,yi))'^, 



where 



R = max{ka{x, x) + kb{y, y)) 

\\w;[Wa + w;!^Wbf<A. 

This suggests the regularization of KCCA because it shows that the quality of the gener- 
alization of the associated pattern function is controlled by the sum of the squares of the 
nor ms of the weight ve ctors . 

Cai and SunI (J201ll ) gave a convergence rate analysis of kernel CCA. Assuming {J-Lx^liy) 



are RKHS of functions on X and y respectively, Vyx is a compact operation from T-Lx to 
T-iy^ and there exist operators Wi,Wr such that 

Vyx = WiY.\x and Yxy = S^yT^r, 

where S^^^ and S^^^ are covariance operators. Taken e„ = ein~" with < a < 1/3, then 
with probability at least 1 — 6, we have 

W^Txil - /)||L < Ce,5n-', ||4/^(5n -?)||1,, < Ce,5n-', (42) 

where 9 = min{l — 3a, 2pa, a} and Cq^s is a constant independent of n. So when < p < ^, 
the convergence rate is min{l — 3a, 2pa, a}. 

7.1.4 Related algorithms with CCA 

CCA has been widely studied in different fields as a general tool for conducting multi-view 
dimensional reduction. Recently many new algorithms based on CCA have been proposed 
to extend the original CCA in different applications. 

One popular use of CCA is for supervised learning, in which one view is derived from 
the data and another view is derived from the class labels. In this setting, the data can b e 
projected into a lower-dimensional space directed by the label information ( Yu et al.l . 12000 ). 



However, this algorithm does not actually use the multi ple views of the data ; it is just a 
single view approach along with the label information. ISharma et al.l ( 20121 ) proposed a 
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Generalized Multi-view Analysis (GMA) which exploits the fact that most popular super- 
vised and unsupervised feature extraction techniques are the solution of a special form of 
quadratically constrained quadratic program. This algorithm can be seen as a supervised 
extension of CCA and has the potential to replace CCA whenever classification or retrieval 
is t he purpose and label in formation is available. 

Chaudhuri et al.l (J2009|) exploited CCA to project the data to the subspace spanned by 



the means, and then applied standard clustering algorithms to this subspace. This subspace 
is valuable for the subsequent clustering, because, when projected onto this subspace, the 
means of the distributions are well-separated, yet the typical distance between points from 
the same distributions is smaller than in the original space. Both traditional CCA and 
KCCA assume that features across all views are available for examples, but this may not be 
the case with rnany m ulti-view datasets. To apply multi-view clustering on such datasets, 
Anusua Trivedil ( 2010l ) found a way to deal with the lack of data in the incomplete views 



with an idea from Laplacian regularization. Given the known part of K, the missing parts 
of kernel matrixX can be found by solving an optimization problem; following construction 
of the full kernel, standard algorithms can conduct the subsequent tasks. 

In semi-supervised learning, a number of labeled examples are usually required for train- 
ing an initial weakly useful predictor which is in turn used to exploit the u nlabeled examples . 



By taking advantage of the correlations between the views using CCA, I Zhou et al.l (|2007l ) 
proposed a method which can perform semi-supervised learning with only one labeled train- 
ing example. With the help of CCA, the similarity between an original unlabeled instance 
and the original labeled instance can be measured. Thus, several unlabeled examples with 
highest and lowest similarity scores can be selected as the extra positive and negative exam- 
ples, respectively. As the number of labeled training examples is increased, the traditional 
semi-supervised learning algorithm can be performed. 



Wang et al.l ( 2008 ) developed a novel multiple kernel learning algorithm, combined with 



CCA. Initially the input data is mapped into m different feature spaces by m different ker- 
nels, where each generated feature space is taken as one view of the input data. Borrowing 
the motivating argument from CCA that m views in the transformed coordinates can be 
maximally correlated^ the g eneralization of classifiers can be improved. Combining CCA 
with PGA, IZhu et al.l ([2012) suggested a novel method called MKCCA to implement di- 



mensionality reduction. MKCCA improves the kernel CCA by performing PGA followed by 
CCA to better remove noises and handle the issue of trivial le arning. Fur t herni ore, compar- 
ing CCA with least squares for regression and classification, ISun et al.l ( 20081 ) formulated 
CCA in multi-label classification as a least square problem. 

7.2 Multi-view Fisher Discriminant Analysis 

n contrast to CCA, which ignores label information. iDiethe et al.l ( 20081 ) generalized Fisher's 



discriminant analysis to find informative projections for multi-view data in a supervised 
setting. 

7.2.1 Two VIEW Fisher Discriminant Analysis 

Given examples drawn from two views of the same underlying semantic object, denoted as 
Xa and Xf, respectively, the two view Fisher discriminant chooses two sets of weights Wa 
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and Wb to solve the following optimization problem 



K Xi yy' X^ Wb 



wlX'^BXaWa + ^ll-WaP) • {wfX'^BXbWb + flWwbW^) 

where Wa and Wb are the weight vectors for each view. Since the equation is not affected 
by rescaling of Wa or Wb, the optimization can be subjected to the following constraints 

w'^XjBXaWa + ti\\Wa\\'^ = 1, 
WbX'^BXbWb + fi\\wb\\'^ = 1. 

The corresponding Lagrangian for this optimization can be written as 

L = w^X^yy^X^Wb - ^{w^X^BXaWa + ^i\\waf - 1) - ^{wfX^BXbWb + fi\\wbf - 1), 

which can be solved by differentiating with respect to the weight vectors Wa and Wb- 

7.2.2 Kernel two view Fisher Discriminant Analysis 

By introducing two dual weight vectors Wa = Xja and Wb = X'^/3, we have 

^^ aXaX^yy^X^X^l3 

'{aXaX^BXaX^a + K\\wa\\^) ■ [pXbX^BXbXlp + K\\wb\\^) ' 

and Its kernel form 

_ aKayy'^Kb/3 

y/{aKaBKaa + K\\Wa\\^) • {/3KbBKb/3 + KWwbW^) ' 

Given the constraints 

aKaBKaOi + KaKad = 1, 
pKbBKbfi + KpKbp = 1, 

the corresponding Lagrangian for this optimization can be written as 

L = aKayy'^Kb(3 - ^{aKaBKaa + KaKaU - I) - ^{/3KbBKb(3 + Kf3Kbl3 - 1). 



Differentiating with respect to the weight vectors a and /3, the above problem can then be 
solved. 

7.3 Multi-view Embedding 

Since high dimensionality, i.e. a large amount of input features, may lead to a large variance 
of estimates, noise, over-fitting, and in general, higher complexity and inefficiency in the 
learners, it is necessary to conduct dimensional reduction and generate low-dimensional 
representations for these features. When faced with multiple features, however, performing 
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a dimensional reduction for each feature is not an ideal solution, considering the underlying 
connections between them. Thus it may be necessary to resort to advanced methods to 
conduct embedding for multiple features simultaneously and to output a meaningful low- 
dimensional embedding shared by all features. 

Existing spectral embedding algorithms assume that samples a re drawn from a vector 
space and thus cannot deal straightforwardly with multi-view data. IXia et al.l ( 2010 ) devel- 



oped a new spectral embedding algorithm, namely, multi-view spectral embedding (MSE), 
which encodes multi-view features to achie ve a physical l y me aningful embedding. Based 
on their previous work of patch alignment ( Zhang et al.l . l2009l ). MSE can be described as 



follows. MSE first builds a patch for a sample on a view, then given the patches from 
different views, part optimization is performed to obtain the optimal low-dimensional em- 
bedding for each view. All low-dimensional embeddings from different patches are then 
unified into one whole by global coordinate alignment. More formally, given the i-th view 
X* = [x\,--- ,x^], consider an arbitrary point x*- and its k nearest neighbors, x* is defined 
as Xj = [xpX^ji, ■ ■ ■ ,x*^]. For Xj, we want to find a part mapping /' : Xj — )• Y,*, where 
Yj = [ypy'j-^^, ■ ■ ■ , yk]. The part optimization for the j-th patch on the i-th view is defined 



as 



mill Y.\\y'j-yhf(^))i' (43) 

'^i 1=1 

where {w^j)i = exp{—\\x^j — x*;|p/t). Eq. (f43l) can be reformulated to 



min tr(YJL)(YJf), (44) 

where tr(-) is the trace operator and U, encodes the objective function for the j-th patch 
on the i-th view. To explore the complementary property of multiple views, a set of non- 
negative weights a = [ai,--- ,am] is imposed on part optimizations, thus the multi-view 
part optimization for the j-th patch is 

m 

min y2aitr(YJLi(YJf). (45) 

To ensure that low dimensional embeddings in different views are globally consistent with 
each one another, assume that the coordinate for YJ = [y*, yji, • • • , y*;,] is selected from the 
global coordinate Y = [yi , • • • ,yn]i which then gives YJ = 1" 5"* , where S* is the selection 
matrix for encoding the relationships of samples in a patch in the original high dimensional 
space. By summing over all part optimizations, the global coordinate alignment can be 
written as 

n m 

min ^^aitr(y5;L;.(S';)^y^). (46) 

From Eq. (mseS), the alignment matrix for the i-th view can be written as 

n 

V = Y,S]^{S]f- (47) 

i=i 
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To make sure that each view makes a particular contribution to the final low dimensional 
embedding, and considering some constraints on the variants, the final objective function 
is defined as 

m 

min Y^ a]tr{YL'YT) (48) 

m 

s.t. yy^ = /, ^ai = 1, Qi > 0, 7> 1. 

i=l 

Finally, MSE can generate a low dimensional sufficiently smooth embedding by preserving 
the locality of each view simultaneously. 

The main idea of Stochastic Neighbor Embedding (SNE) is to construct probability 
distributions from pair wise distances wherein larger distances correspond to smaller proba- 
bilities and vice versa. Formally, suppose we have high-dimensional data points {xj}"^^, the 
joint probability distribution over sample pairs can be represented in a symmetric matrix 
P £ M"-^"-, where pu = and Yli j Pij = 1- Let yi be the low dimensional data corresponding 
to Xi, then the probability distribution Q in low dimensional embedding is defined as 

This embedding can be acquired by minimizing the KL divergence of the two probability 
distributions, 

KL{P\Q) = Y,Pv^og^. (50) 



Xie et al.l ( 201ll ) proposed the m-SNE algorithm to generalize SNE to handle multi- 



view 



data by introducing one combination coefficient to each view. The final probability distri- 
bution on the high dimensional space is then 



^ii = Z]"%' (^^) 



t=i 

where a* is the combination coefficient for view t and p* • is the probability distribution on 
view t. This combination coefficient plays an important role in utilizing the complementary 
information and suppressing noise in multi-view data. Additionally, the original objective 
function contains only KL divergence; a 2-norm regularization term is added to balance the 
coefficients over all views 

g{a) = Y,Pijlog^ + X\\af, (52) 

wh ere A is the trade off coefficient. 

Han et al.l ( 20121 ) proposed a new framework of sparse unsupervised dimensionality re- 



duction for multi-view data. Considering the specific statistical property of each view, this 
algorithm first learns low-dimensional patterns from these views using the principal com- 
ponent analysis (PCA) algorithm. After combining the learned low-dimensional pattern 
of each view into one unified pattern, the construction of the low-dimensional consensus 
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representation can be formulated to approximate the matrix of patterns by means of a low- 
dimensional consensus base matrix and a loading matrix. To select the most discriminative 
feature for the spectral embedding of multiple views, a 1-norm is added into the loading ma- 
trix's columns and orthogonal constraints are imposed on the base matrix. A novel method 
called Spectral Sparse Multi-View Embedding (SSMVE) was subsequently developed to ef- 
ficiently obtain the solution. Furthermore, since each row of the loading matrix is a vector 
concatenated by several parts which correspond to the different patterns learned from dif- 
ferent views, a novel structured sparsity-inducing norm penalty was imposed on the loading 
matrix's rows to gain flexibility in sharing information across subsets of the views. Conse- 
quently, another approach for multi-view dimensionality reduction with structured sparsity 
penalty, namely. Structured Sparse Multi-View Dimensionality reduction (SSMVD), was 
proposed. 

7.4 Multi-view Metric Learning 

The goal of metric learning for multi-view data is to construct embedding projections from 
the data in different representations into a shared feature space, so that the Euclidean 
distance in this space is meaningful not only within a single view, but also between different 
views. 

Motivated by cross-media retrieval tasks, iQuadrianto and LampertI ( 201ll ) studied the 



metric learning problem to find the joint Euclidean distance function to allow nearest neigh- 
bor queries. Following the classical principle of pulling samples together if they are related 
and pushing them apart if they are not, multi-view metric learning is formulated as follows. 
Suppose there are two sets of m data points, X = {xi,--- ,Xm} and Y = {yi,--- ,ym} 
describing the same objects from two different views, and for each Xj G X there exists a set 
Sxi of data points from Y which are similar to Xi. Given X = M^^ and Y = M'^^, we seek 
the projection functions, 

gi : M'^i — > R^ and 52 : M*^' — > K^, 

with D <^ 'min{di, ^2) that respects the neighborhood relationship {5^.}^]^. Considering a 
linear parameterization of the functions gi{xi) = {wi,(l){xi)) and 92(2/1) = {w2,4>{yi))-, then 
the metrics wi and W2 are the goal of the learning, and the objective function can be written 
as 

m 

L{wi,W2,X,Y,S) = '^ L''^{wi,W2,Xi,yj,SxJ + r]Q{wi) + 'yn{w2), (53) 

where L^'-'{-) is the loss function, Q(-) is a regularizer on the parameters and rj and 7 are 
trade-off variables. By choosing the loss function appropriately, the properties the projected 
data are expected to have can be expressed. In particular, if it is hoped to ensure that similar 
objects across different views are mapped to nearby points, whereas dissimilar objects across 
different views are to be pushed apart, the loss function can be designed as the union of 
two different parts, 

L(wi ,W2,X, Y, S) = — ^ X L{-' + '- X L^-' , (54) 
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where the similarity term Vf forces similar objects to be at proximal locations in the latent 
space and the dissimilar term L^2 pushes dissimilar objects away from one another. This 
objective function can be decomposed into a difference of two concave functions, thus it can 
be solved efficiently by the concave convex procedure (CCCP). 

Since various different low-level visual features can be extracted to comprehensively 
represent the image in image processing, it is d ifficult to c hoose w hich feature to depend on to 



measure the similarity between images. Thus lYu et al.l (|2012bl l proposed a semi-supervised 



multi-view distance metric learning (SSM-DML) algorithm to construct an accurate metric 
to precisely measure the dissimilarity between different examples associated with multiple 
views. Formally define a matrix F = [F^ , • • • , F^]^, where Fij is the confidence of Xi with 
the label yj, and then this matrix F can be obtained by minimizing the following objective 
function: 

^=Ewdl^-^f + MEl|F^-Y.f, (55) 

where W is an affinity matrix with Wij indicating the dissimilarity measure between x, 
and Xj, and D is a diagonal matrix with Da equal to the sum of the i-th row of W. The 
first term in Eq. (|55p implies the smoothness of the labels on the graph and the second 
term indicates the constraint of the training data. Suppose X* represents the z-th view of 
the example; by linearly combining the graphs constructed from multi-view features sets 
through the weights a, Eq. (j55]) can be extended to the multi-view feature sets 



K N ^ ^ N 



Q -- 


= EE°'=wj.||. 

k=l i,j=l 






s.t. 


K 

>Jafc = f. 







+ ^^||F,-Y,f + A||af (56) 



i=l 



fc=l 



Then through adopting alternating optimization to solve the above optimization problem, 
SSM-DML can learn the multi-view distance metrics from multiple feature sets and the 
lab els of unlabeled d ata simultaneously. 

Zhai et al.l ( 20121 ) also studied the multi-view metric learning problem in the semi- 



supervised learning setting, and proposed a new method called Multi-view Metric Learning 
with Global consistency and Local smoothness (MVML-GL), which jointly considers global 
consistency and local smoothness. This algorithm is accomplished in two steps: (1) seek 
a shared latent feature space to establish the relationship between data from multi-view 
observation spaces according to pairs of labeled instances; (2) learn the relationships be- 
tween the input space of each observation and the shared latent space for unlabeled and 
test data. It is worth noting that this first step is globally consistent, as it simultaneously 
considers the geometric structures contained in each view and connections between the data 
from different views, and the second step is locally smooth, which enables each instance to 
have its own specific distance metric instead of applying a uniform metric for all instances. 
Additionally, both steps can be formulated as convex optimization problems with closed 
form solutions, thus they can be efficiently solved. 
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7.5 Latent Space Models 

Besides the aforementioned methods, which aim to conduct meaningful dimensional reduc- 
tion for multi-view data, there are also works that concentrate on analyzing the relationships 
between different views. These methods are used to build latent space models, with which 
multiple views can be connected with one another through latent variables, and the infor- 
mation can be propagated from one view to another view. 

7.5.1 Shared Gaussian Process Latent Variable Model 

Gaussian processes (GPs) are powerful models for classification and regression that subsume 
numerous classe s of functi o n app roximators, such as single hidden-layer neural networks and 
RBF networks. iLawrencd ( 20041 ) first proposed the Gaussian process lat ent variable mode l 



(GPLVM) as a new technique for non-linear dimensional reduction. IShon et al.l (120061 



proposed the shared GPLVM (SGPLVM) as a generalization of the GPLVM model that 
can handle multiple observation spaces, where each set of observations is parameterized by 
a different set of kernel parameters. 

Let Y, Z be matrices of observations drawn from spaces of dimensionality Dy , Dz re- 
spectively, and X be a latent space of dimensionality Dx ^ Dy,Dz- Assume that each 
latent point Xi generates a pair of observations yi, Zi via GPs parameterized non-linear func- 
tions /y : V — 7- y and /^ : V — )• Z. By using an exponential (RBF) kernel to define the 
similarity between two data points x, x 

k{x,x) = ayexpi — r-||x-x|| ) + (^^.^^.'/^y , (57) 

the priors P{ey),P{ez),P{Ox) {0 = {a,/3,j}) and the likelihoods P{Y),P{Z) for the Y, Z 
observation spaces are given by 

P{Y\9y,X) = , ' ' ^exp( — 
^ ' ^ V(27r)^^y|i^|«i' ^ " 

P(Z\9z,X) = , ' ' expj-- 

^ ' ^ y/i2Tr)^ D^\K\Dz ^^ 

p(0y)oc — ^ — P{ez)^ 

OYPy^y 

then the joint likelihood can be written as 

PGp{X,Y,Z,0y,ez)=P{Y\ey,X)P{Z\ey,X)P{0y)P{ez)PiX). (62) 

By using a conjugate gradient solver to maximize Eq. (162p . the model can learn a separate 
kernel for each observation space and a single set of common latent points. 

Given a trained SGPLVM, we would like to infer the parameters in one observation 
space given the parameters in the other observation space. This problem can be solved in 
two steps. First, we determine the most likely latent coordinate x given the observation y 
using maxa; Lx{x,y). Once the correct latent coordinate x has been inferred for a given y, 
the model uses the trained SGPLVM to predict the corresponding observation z. 
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7.5.2 Shared Kernel Information Embedding 



Give n samples drawn from a distribution p(x), Kernel Information Embedding (JMemisevic 



20061 ) aims to find a low-dimensional latent distribution, p{z), that captures the structure of 



the data, along with explicit bidirectional probabilistic mappings between the latent space 
and the data space. In particular, KIE finds the joint distribution p{x, z) that maximizes 
the mutual information between the latent distribution and the data distribution: 

/T){ X Z) 
p{x, z)log , , ' - dxdz (63) 

p[x)p[z) 

= H{x) + H{z)-H{x,z), 

where H{-) is the usual Sha nnon entropy, which can be estimat e d by kernel density. 

The shared KIE f sKIE) (jSigal et all boool : iMemisevic et al.l . I2OI2I ). which can be seen 



as the extension of KIE, constructs the joint embedding for two views by maximizing the 
mutual information I{{x,y),z). Assuming the conditional independence of x and y given 
z, I{{x, y), z) can be expressed as a sum of two mutual information terms, 

I{{x,y),z)=I{x,z) + I{y,z), (64) 

where I(x, z) and /(y, z) can be formulated as KIE. 

An application of this algorithm is human pose inference. For discriminative pose in- 
ference, the aim is to find likely poses y conditioned on input image features x* . Then the 
conditional pose distribution is: 

p{y\x*) = I p{y\z)p{z\x*)dz. (65) 

Alternatively, the focus can be on identifying the principal modes oi p{y\x*). To this end, 
it is assumed that the principal modes oi p{y\x*) coincide with the principal modes of the 
conditional latent distribution p(z|x*). That is, a search is first conducted for local maxima 
oip{z\x*), denoted {z^}^]^ for K modes. From these latent points it is straightforward to 
perform either MAP inference or take the expectation over the conditional pose distributions 

p{y\zl)- 

7.5.3 Factorized Orthogonal Latent Space 

Both sGPLVM and sKIE only consid er the shared in f orma tion in the views of data but 
ignore the private part in each view. ISalzmann et al.l ( 20ld ) proposed a robust approach 



called FOLS to factorize the latent space into shared and private spaces by introducing 
orthogonality constraints, which penalize redundant latent representations. 

For minimal factorization, the shared and private latent spaces are required to be non- 
redundant; in other words, it is desirable to penalize the redundancy of different private 
spaces and thus encourage the representation of common information in the shared space. 
More formally, define Y^ = [y\, ■ ■ ■ , y]\r] as the set of observations from a single view i, with 
1 < i <V . Additionally, lei X = [xi,- ■ ■ ,xnY' be the latent space shared across different 
views, Z* = [Z\, • • • , Z\^]^ be the private space for z-th view, and M* = \m\, • • • , m^v]"^ ^^ 
the joint shared-private latent space for each view, with m}- = [xj,z*]. By imposing the 
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above mentioned non-redundant constraint as a soft penalty, a FOLS model can be learned 
by minimizing 

L = L + aY,i\\X^-Z% + Y.UZY-^'\\F) (66) 

i j>i 



orthogonality 



+7 E'^(^^) +^E(^o-E« 



2 \2 



low dimensionality energy conservation 

where Si are the singular values of M*, Eq is the energy of stream i, and L is the loss 
function of the particular model into which the factorization constraints are introduced. 
In the sGPLVM and sKIE models, L represents the square loss, or the negative mutual 
information between each joint latent space and its corresponding data stream. 

7.5.4 Factorized Latent Spaces with Structured Sparsity 

Inspired by sparse coding techniques. I Jia et al.l (J20ld ) proposed a novel approach to finding 



a latent space in which the information is correctly factorized into shared and private 
parts, while avoiding the computational burden of previous techniques. In particular, this 
algorithm represents each view as a linear combination of view-dependent dictionary entries. 
While the dictionaries are specific to each view, the weights of these dictionaries act as latent 
variables and are the same for all the views. 

More formally, to find a shared-private factorization of the latent embedding a that 
represents the multiple input modalities, the algorithm adopts the idea of structured sparsity 
and aims to find a set of dictionaries P = {D^, • • • , D^}. This problem can be formulated 
as, 

1 ^ ^ 

^i^ M E 11^' - ^'«ll^ + ^ E ^((^■")^) + ^V'la), (67) 

v=l v=l 

where the first item measures the loss, the second item encourages each view to only use 
a limited number of latent dimensions, and the third item indicates a relaxation of rank 
constraints to discover the dimensionality of the latent space. 

At inference, given a new observation {xl, • • • , x^}, the corresponding latent embedding 
a=K can be obtained by solving the convex problem 



V 
mm 

v=l 



mJ2\K-D''a4l + -f\\a4i, (68) 



where the regularizer allows us to deal with noise in the observations. 
7.5.5 Latent Space Markov network 



Chen et al.l ( 20ld ) constructed a predictive subspace shared by multi-view data based on 



the generic multi-view latent space Markov network (MN), under the assumption that the 
data from different views and the response variables are conditionally independent given a 
set of latent variables. 
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The two- view latent space Markov networks consist of two views of input data X : {X„} 
and Z : {Z^} and a set of latent variables H : {Hk}- According to random field theory, the 
marginal distributions for two views respectively can be written in the exponential forms 

p{x) = exp{^ef<Pixi,Xi+i)-Aie)}, (69) 

i 

p{z) = exp{^rif'il^{zj,Zj+i)-B{r])}, (70) 



where (j) and ■0 are feature functions, A and B are log partition functions. For the latent 
variables, the marginal distribution is 

pih) = J]exp{A^(^(/ifc) - Cfc(Afc)}, (71) 



where '^{hk) is the feature vector of /i^, C^is the log-partition function. By combining the 
above components in the log-domain, the joint model distribution is defined as 

p{x,z,h) oc exp{^efcl){xi,Xi+i) + '^r]f^{zj, Zj+i) + Xlip{hk) 
+ Y^ (l){xi,Xi+ifW^^ip{hk) + Y^ i^izj, Zj+ifU^ip{hk). 

ik jk 

Additionally considering each input sample is associated with a supervised response 
variable y G {1, • • • , T}, we can define 

2-^y'exp{V'^f{h,y')} 

where f{h,y) is the feature vector whose elements from (y — 1)K + 1 to yK are those of 
h and all others are 0. Accordingly, y is a stacking parameter vector of T sub-vectors Vy, 
each of which corresponds to a class label y. 

Although t his multi-view laten t space MNs can be learned by maximum likelihood esti- 
mation (MLE). IChen et al.l ( 2010 ) estimated the decision boundary directly in a large mar- 



gin approach. Assume the discriminant function F{y, h; V) is linear, that is, F{y, h; V) = 
V f{h,y), which looks like the discriminant function W X in SVM. Then the objective 
function is 

min L{e) + ^Ci\\Vf + C2nhinge{V), (73) 

where the first item L{Q) = —^^log p{xd,Zd) is the negative data likelihood, the second 
item is the constraint of the decision boundary, and the third item hinge loss acts as the slack 
variable ^ in SVM. Since Eq. (j73p maximizes the data likelihood and minimizes training 
loss, it can be expected that by solving this problem we can find a predictive latent space 
representation p{h\x,z) and a prediction model parameter V at the same time. 
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8. Applications 

In general, by exploiting the consistency and complement of multiple views, learning models 
from multi-view data will lead to an improvement in learning performance. Thus multi-view 
learning has been applied successfully in a number of real-world applications. 

Since iBlum and Mitchelll ( 19981 ) first proposed the co-training algorithm and applied it 
to the web document classification problem, this novel method has caught the attention 
of m any researche r s and has been widely applied in the field of natur al language process- 
i ng (ICraven et al.1 . I2OO0I : iMiiller et all . 12002 : IPhiUiDS and Rilofj . I2OO2I I. IPierce and Cardie 
( 200ll ) studied the learning behavior of co-training and showed that given a small set of 
labeled training data and a large set of unlabeled data, co-training can reduce the difference 
in error between co-trained classifiers and fully supervised classifiers trained on a labeled 
version of all available data by 36%. Unlik e previous efforts which cope with the task of word 
sense disambiguation in a supervised way, iMihalceal (J2004 ) suggested combining co-training 
with majority voting, with the effect of smoothing the learning curves to improve average 
performance. iMaeireizo et al.l ( 2004 ) investigated the applicability of co-training to train 
classifiers that predict emotions in spo ken dialogues on features pre-pr o cessed in a wrap- 
per ap proa ch with f orwar d selection. iKiritchenko and MatwinI ( 200ll ): iKockelkorn et al. 



(2003) and IScheffed ( 2004 ) treated the email classification problem in the framework of 
semi-supervised learning, so that the cost of labeling unlabeled data could be eliminated, 
and a co-training method employed to significantly improve learning performance. Besides 
these applications involving text or natural language pro cessing, co-training h as also found 
application in the field of computer vision. For instance, iLiu and YuenI ()201ll ) studied the 
human action recognition problem and introduced two new confidence measures, i.e. inter- 
view confidence and in tra-view confidence , to add ress view sufficiency and view dependency 
issues in co-training. IChristoudias et al.l (|2009al ) designed a probabilistic heteroscedastic 
approach to co-trainin g, which discovers the amou nt of noise while s olving multi-view ob- 
ject recognition tasks. iFeng and Chual (J2003l ) and iFeng et al.l ( 2004 ) addressed the image 
annotating problem by combining co-training with active learning. Thus the requisition for 
the large labeled training corpus for effective learning is relaxed in co-training and the best 
examples are selected to label at each stage to maximize the learning objective in active 
learning. Considering various kinds of visual fea tures, such a s colo r an d texture features, 
as suff icient and uncorrelated views of an image, IZhou et al.l ( 2004 ) and ICheng and Wane 
( 20071 ) introduced a co-training algorithm to conduct relevance feedback in content-based 
image retrieval. 

As for multiple ker nel learning, iKumar and Sminchisescul ( 20071 ): iLin et al.l ( 20071 ) and 
Varma and Rayl (120071 ) applied it to object classification by linearly combining similarity 
functions between ima ges so that the combined sim ilarity function yields improved classi- 
fication performance. iLongworth and GalesI ( 20081 ) employed multiple kernel learning for 
object detection with the goal of finding an optimal combination of exponential x^ kernels, 
each of which wou ld capture a diffe r ent fe ature channel, such as the distribution of edges, 
and visual words. iKembhavi et al.l ((20091) proposed an incremental multiple kernel learn- 
ing approach for object recognition. In this case, "incremental" means that the images of 
objects in poses more commonly observed in the scene as well as the kernel weights will be 
updated in each iteration, thus further improving the learning performance. 
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Subspace learning is an important tool for analyzing the relationsh i ps bet ween differ 



ent views of the data and has a number of applications. iDonner et alj (12006 ) intr oduced 



a fast active appearance model search algorithm based on CCA. IZheng et al. 



(|2006) used 



KCCA to solve the facial expression recognition problem. iDhillon et al.l ( 201ll ) computed 
the CCA between different views of the data to estimat e low dirnensiori al context specific 
word representations from unlabeled data in NLP tasks. IFu et al.l ( 20081 ) effectively solved 
the face recognition task by constructing a linear subspace in which the cumulativ e canon- 
ical correlation between any pair of feature sets is maximized. IZhang et al.l ( 20121 ) studied 
the hyperspectral remote sensing image classification problem in the approach of multi- 
view learning, and introduced the patch alignment framework to linearly combine multiple 
features in an optimal way and a unified low-dimensional representation of these multiple 
features for subsequent classification. Considering that the key issue in cartoon character 



retrieva l is proper representation that describes the cartoon character effectively, lYu et al. 



(|2012al ) introduced a semi-supervised multi-view subspace learning algorithm which encodes 
different features in a unified space, as illustrated in Figure [5j In this unified subspace, the 
Euclidean distance can be straightforwardly used to measure the distance between two car- 
toon characters . To improve t he performance of the ranking and difficulty estimation in 
image retrieval, iLi et al.l (120 111 ) applied multi-view embedding (ME) to images represented 
by multiple features for integrating a joint subspace by preserving the neighborhood infor- 
mation in each feature space, as illustrated in Figure [H To eliminate the "out of sample" 
and huge computation cost problem, a linear multi-view embedding algorithm was devel- 
oped which learns a linear transformation from a small set of data and can effectively infer 
the subspace features of new data. 



9. Performance Evaluation 

In this section, we introduce some widely used datasets in multi-view learning experiments 
and make an empirical comparison of several representative multi-view learning algorithms 
with single- view learning algorithms. 

Data Sets for Multi-vie^v Learning. So far, several datasets have been widely 
employed in multi-view learning experiments. Here we give a simple introduction to these 
datasets. 

• WebKB dataset B| is the most famous dataset used in multi-view learning, on which 
the co-training algorithm was first evaluated. This dataset consists of 8282 academic 
web pages collected from computer science department web sites at four universities: 
Cornell, University of Washington, University of Wisconsin, and University of Texas. 
These pages can be grouped into six classes: student, staff, faculty, department, course 
and project. There are two views containing the text on the page and the anchor text 
of hyperlink respectively. 

• Citeseer dataset □ is a collection of scientific publications which contains 3312 doc- 
uments belonging to six classes. There are three natural views for each document: 



1. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20 

/www/data/ 

2. http://komarix.org/ac/ds/ 
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Figure 5: Flowchart of th e semi-supervised multi-view subspace learning algorithm 



( Yu et all l2012al ). The method first extracts multi-view features from cartoon 



characters. Then, by considering the constraints of each local patch and the 
complementary characteristics of multi-view features, the low dimensional repre- 
sentation Y can be obtained through solving an alternating optimization problem. 
Finally, the cartoon character retrieval and clip synthesis can be conducted by 
measuring the dissimilarity in the subspace Y. 



the text view consists of the title and abstract of the paper; the two link views are 
inbound and outbound references. 

Some popular data sets coming from UCI repository |j are suitable for evaluating 
multi-view learning. For example, the internet advertisement dataset contains im- 
ages from various web pages that are characterized either as advertisements or non- 
advertisements. The instances are described in terms of six views, which are the 
geometry of the images, the base url, the image url, the target url, the anchor text 
and the alt text. 

There are also a number of other multimedia datasets usually employed in experi- 
ments on image annotation, image classification and image retrieval, which include 
TRECVID2003 video dataset , Caltech256 I , etc. We extract different visual fea- 
tures to represent multiple views of the data, such as color histogram, edge direction 
histogram, and wavelet texture. 



3. http://archive.ics.uci.edu/ml/ 

4. http;//www-nlpir. nist.gov/projects/tv2003/ 

5. http;//www. vision. caItech.edu/Image_Datasets/Caltech256/ 
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Figure 6: Application of linear multi-view embedding in difficulty-guided image retrieval 
(JLiet alilionl ^. 



Empirical Evaluation. To illustrate the benefits of multi-view learning methods com- 
pared to traditional single-view lear ning, Table [J prese n ts a l is t drawn from several pub - 
lished multi-view learning papers. Blum and Mitchell ( 19981 ): Nigam and Ghani (20001); 
Brefeld and Scheffeil (|2004i ): ISindhwani et al.l (l200a ): .Yu et al.l (|201lh and lzhu et al.l (|2012 ) 
used the WebKB data as one of the evaluation datasets. Due to the different preprocessing 
steps of the algorithms by different researchers, it is difficult to make a direct compari- 
son of the proposed methods; thus we denote them as WebKBi, • • • , WebKBg respectively 
and show the comparison results between the proposed multi-view learning methods and 
single-view learning methods in the table. 



On the WebKB 1 data, iBlum and Mitchelll ( 19981 ) evaluated the co-training algorithm 

and compared it s performance with that of the single- view learning algorithm nai ve Bayes. 

On th e WebKB9. lNigam and Ghanil ( 2000l ) evaluated the proposed co-EM method. iBrefeld and Scheffer 
(|2004l ) developed a novel co-EM based on SVM and showed its satisfact ory performance 



compa red to single- view SVM and co-trained naive Bayes on the WebKBa. ISindhwani et al. 



( 20051 ) evaluated their proposed co-regularization method on the WebKB4, and compared 
it to the s ingle- view regularization method, single-view SVM and co-trained Laplace SVM. 
Yu et al.l (|201lh illustrated the co-training algorithm in a graphical way, developed Bayesian 
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Table 1: Comparison between multi-view learning and single-view learning methods 



DataSct {reference) 


Data 


Single-view 


Multi-view 1 


WebKBi 

(Blum and Mitchell. 1998) 




naive Bayes 




Co-trained NB 




Page 


12.9% 




6.2% 




Error rate 


Hyperlink 


12.4% 




11.6% 




l-'agc+Hyperlink 


11.1% 




5.0% 




WebKB2 

CNigam and Ghani, 2000) 




naive Bayes 




Co-trained NB 


Co-EM NB 


l-'agc+Hyperlink 


13.0% 




5.4% 


4.3% 


Error rate 


WobKBs 

(Brofold and Scheffer, 20041 




naive Bayes 


SVM 


Co-EM NB 


Co-EM SVM 


Page+Hyperlink 


13.0%, 


10.39% 


5.08% 


0.99% 


Error rate 


WcbKB4 

(Sindhwani et al., 2005) 




SVM 


RLS 


Co-LapSVM 


Co-LapRLS 


Page 


n.»% 


71.6% 


93.3% 


92.0%, 




Hyperlink 


74.4% 


72.0%, 


94.3% 


94.4% 


Pagc+Hyperlink 


84.4% 


78.3% 


94.2% 


93.6% 


WebKBs 

(Yu et al., 2011) 




GPLR 




Co-trained GPLR 


Bayesian Co-training 


Page+Hyperlink 


0.5*7% 




0.56 % 


0.58% 




WebKBe 

(Zhu et al.. 2012) 




KPCA 




KCCA 


MKCCA 


Pagc+Hyperlink 


94.5% 




86.6% 


94.6% 


AUG 


UCIi 

(Gonen and Alpavdin. 2008) 




SVM Kj, 




MKL Kp - Kg 


LMKL Kp - Kg 


Banana 


56.51% 




81.99% 


83.84% 


kid 


Heart 


72.78% 




75.78% 


79.44%, 


ionosplierc 


91.54% 




93.68% 


93.33% 


Pima 


66.95% 




9S.S6% 


98.69% 


yonar 


65.29% 




SO. 29% 




UCl2 

(Varma and Babu. 2009) 




LP-SVM 




MKL 


GMKL 


ionosphere 


93.0% 




S7.7% 


94.1% 


Aid 


Parkinsons 


86.2% 




84.7% 


92.6% 


Musk 


81.5% 




87.0% 


93.3% 


yonar 


73.7% 




79.5% 


82.0% 


Wpbc 


76.2% 




69.4% 


78.3%, 


UCI3 

fRakotomamoniv et al.. 2008) 








SILP 


Simple MKL 


Liver 






65.9% (47.6) 


65.9%o (18.9) 


ACG (Time(s)) 


Pima 






76.5% (224) 


76.5% (79.0) 


Ionosphere 






91.7% (535) 


gi.5%0 (123) 


Wpbc 






76.8%i (88.6) 


76.7% (20.6) 


Sonar 






80.5%, (2290) 


80.6% (163) 


UCI4 

(Xu et al.. 2010) 








Simple MKL 


MKLGL 


Ionosphere 






91.5%, (79.9) 


92.0%o (12.0) 




Breast 






96.5% (110.5) 


96.6% (14.1) 


Sonar 






82.0%<, (57.0) 


82.0% (5.7) 


Pima 






73.4% (94.5) 


73.5% (15.1) 



co-training, and performed experiments on the WebKBs. On the WebKBfi. lZhu et alj (J2012l ) 
compared the performances of multi-view approaches and single-view approaches in respect 
of subspace learning. 

Gonen an d Alpavdinl ( 20081 ) : IVarma and Babul ( 20091 ) : lRakotomamoniv et al.l ( 20081 ) and 



Xu et al. (_201Q|) used the benchmark datasets from the UCI machine learning repository. 
Thus we use UCIi, • • • , UCI4 to denote the respective different experiments of these works. 
In these experiments, several representative multiple kernel learning methods, such as lo- 
calized MKL and simple MKL, were evaluated in terms of accuracy and time cost. From 
these comparison results, we discover that multi-view learning methods designed appropri- 
ately for real- world applications can indeed improve performance significantly compared to 
single- view learning methods. 



10. Conclusions 

In many scenarios, more than one view can be provided to describe the data. Instead of 
selecting one view from the corpus or simply concatenating them for learning, we are more 
interested in algorithms that can learn models from multi-view data by considering the 
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diversity of different views. In this survey paper, we have therefore reviewed several current 
trends of multi-view learning and classified these algorithms into three different settings: 
co-training, multiple kernel learning, and subspace learning. Through analyzing these dif- 
ferent approaches to the integration of multiple views, we observe that they mainly depend 
on either the consensus principle or the complementary principle to ensure their success. 
Furthermore, we also studied the problems with respect to how to construct multiple views 
and how to evaluate these views. The experimental results show the extensive development 
of multi-view learning and its promising performance compared to single-view learning. 

Although significant work has been carried out in this field, several important research 
issues need to be addressed in the future. Since the properties of different views largely 
influence the performance of multi-view learning, it is necessary to place more emphasis on 
methods to construct, analyze and evaluate the views. For the three groups of multi-view 
learning algorithms, each have their own advantages, but they are mainly designed and 
developed separately. Therefore it would be valuable to develop a general framework of 
multi-view learning which includes the merits of different multi-view learning methods. 

We conclude that multi-view learning is effective and promising in practice, but it has 
not been well-addressed to date. There is still much work to be done to better process 
multi-view data in a wide variety of applications. 
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